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Abstract 

We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted 
Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, 
Mahmood, and White (2015) as an improved solution to the problem of divergence of off-policy 
temporal-difference learning with linear function approximation. We present in this paper the first 
convergence proofs for two emphatic algorithms, ETD(A) and ELSTD(A). We prove, under general 
off-policy conditions, the convergence in for ELSTD(A) iterates, and the almost sure conver¬ 
gence of the approximate value functions calculated by both algorithms using a single infinitely 
long trajectory. Our analysis involves new techniques with applications beyond emphatic algo¬ 
rithms leading, for example, to the first proof that standard TD(A) also converges under off-policy 
training for A sufficiently large. 


Keywords: Markov decision processes; approximate policy evaluation; reinforcement learning; 
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version of this paper appeared at the 28th Annual Conference on Learning Theory (COLT), Paris, France, 2015. 

^ This version corrects an oversight in a proof in the first version: the original Prop. C.l needs to be proved based on 
the proof of the first part of Prop. C.2. Corrections are now made in the statement of Prop. C.l, the last paragraph of 
the proof of Prop. C.2, and the references to these two propositions in Appendix C and Section 2.2. The conclusions 
of the paper have not been affected. 
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1. Introduction 

We consider discounted finite-spaces Markov decision processes (MDPs) and the problem of learn¬ 
ing an approximate value function for a given policy from off-policy data, that is, from data due to 
a different policy. The first policy is called the target policy and the second is called the behavior 
policy. For example, one may want to learn value functions for many target policies in parallel from 
one (exploratory) behavior; this requires off-policy learning. 

We focus on temporal-difference (TD) methods with linear function approximation (Sutton, 
1988). Such methods are typically convergent when the target and behavior policies are the same 
(the on-policy case), but not in the off-policy case (Tsitsiklis and Van Roy, 1997). This difficulty is 
intrinsic to sampling states according to an arbitrary policy. ^ Gradient-based or least squares-based 
approaches have been used to avoid this difficulty.^ 

Recently, Sutton, Mahmood, and White (2015) proposed a new approach to address this issue 
more directly. They introduced an emphatic TD(X) algorithm, or ETD(X) as we call it here. The 
approach is related to the early work on episodic off-policy TD(A) (Precup et ah, 2001), and is 
based on the idea of re-weighting the states when forming the eligibility traces in TD(A), so that the 
weights reflect the occupation frequencies of the target policy rather than the behavior policy. The 
result of this weighting scheme is that the “mean updates” associated with ETD(A) now involve a 
negative definite matrix, similar to the convergent on-policy TD algorithms. This is a salient feature 
of the emphatic TD method. 

The purpose of this paper is to investigate the convergence properties of ETD(A) and its least- 
squares version, EESTD(A). Under general conditions, we show that (see Theorems 2.1, 2.2): 

(i) for stepsizes decreasing as t~'^,c G (1/2,1], the matrix and vector iterates generated by 
EESTD(A) converge in to the desired limits, which define a projecfed Bellman equation; 

(ii) for stepsizes decreasing as bofh algorifhms generafe approximafe value funclions fhaf 
converge almosf surely fo fhe desired solufion of an associated projecfed Bellman equafion. 

These resulfs show fhaf fhe new emphafic TD algorifhms are sound for off-policy learning. 

Regarding proof fechniques, we nofe fhaf allhough fhe “mean updates” of ETD(A) involve a 
negafive definite malrix, if is still difficull fo direcfly apply resulfs from slochaslic approximation 
Iheory fo eslablish rigorously fhe associafion befween fhe “mean updafes” and fhe ETD(A) iferafes, 
fhereby obfaining fhe desired convergence. The sfabilify criferion of (Borkar and Meyn, 2000) (see 
also (Borkar, 2008, Chap. 3)) and fhe “nafural averaging” argumenf in (Borkar, 2008, Chap. 6) seem 
suifable, buf fhey require a cerfain fighfness condition fhaf is hard fo verify in fhe general off-policy 
learning selling where fhe variances of fhe Irace iferafes can grow fo infinily wilh fime.^ The analysis 
of (Tsilsiklis and Van Roy, 1997) has a slrong condition (Condition (6), p. 683, in particular), which 
is difficull fo salisfy unless fhe Irace iferafes are uniformly bounded. Buf in general. Ibis would 
impose a slrong reslriclion on fhe behavior policy (cf. Yu, 2012, Prop. 3.1, Eoolnofe 3, and fhe 
discussion in p. 3320-3322). 

Eor regular off-policy ESTD(A) and TD(A) (Berlsekas and Yu, 2009), if has been shown by Yu 
(2012) fhaf fhe associaled joinl process of slates and Irace iferafes exhibil useful properties, by which 
convergence resulfs for ESTD(A) can be derived. Subsequenlly, fhe resulfs can be used fo furnish 

1. See the papers (Baird, 1995; Tsitsiklis and Van Roy, 1997; Sutton et al., 2015) and the books (Bertsekas and Tsitsiklis, 

1996; Sutton and Barto, 1998) for related examples and discussion. 

2. See e.g., (Maei, 2011; Bertsekas and Yu, 2009; Geist and Scherrer, 2014; Dann et ah, 2014). 

3. Related examples can be found in (Glynn and Iglehart, 1989; Randhawa and Juneja, 2004; Sutton et al., 2015). 
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the conditions of a convergence theorem from stochastic approximation theory (Kushner and Yin, 
2003) and yield convergence results for TD(A). In this paper we will take the proof approach used 
in (Yu, 2012). We note, however, that most of the intermediate results needed in our case require 
different and more involved proofs, due to the complexity of the emphatic TD method. Furthermore, 
we will give a new argument to prove the almost sure convergence of ETD(A), which applies also 
to the regular off-policy TD(A) of (Bertsekas and Yu, 2009) for A near 1. This improves a result of 
(Yu, 2012), which only dealt with a constrained version of TD(A) that restricts the iterates to lie in 
a bounded set. 

This paper is organized as follows. In Section 2 we formulate the approximate policy evaluation 
problem, and we describe the ETD(A) and EESTD(A) algorithms, and the approximate Bellman 
equations they aim to solve. We also state our main convergence results in this section. In Section 3 
we prove our convergence theorem for EESTD(A), and prepare results needed for analyzing ETD(A) 
with a “mean ODE”^ method. In Section 4 we prove our convergence theorem for ETD(A). We 
collect long proofs, technical lemmas and other related results in Appendices A-C. 


2. Emphatic TD Algorithms: ETD(A) and ELSTD(A) 

2.1. A Policy Evaluation Problem in Off-Policy Learning 

Eet 5 = {l,...,A}be the state space, and let Al be a finite set of actions. We assume, without loss 
of generality, that for every state, all actions are feasible. If we take action a G Al at state s € S, the 
system moves from state s to state s' with probability p(s' | s, a), and we receive a random reward 
with mean r(s, a, s') and bounded variance, according to a probability distribution q(- | s, a, s'). 

We are interested in evaluating the performance of a given stationary policy^ vr, the target policy, 
without knowledge of the MDP model. The evaluation is to be done by using just observations of 
state transitions and rewards, while following a stationary policy 7r° / vr, the behavior policy. 

Starting from time t = 0, applying vr would generate a sequence of rewards Rq, i?i,.... The 
performance of vr will be measured in terms of the expected total rewards attained under vr up to 
a random termination time r > 1 that depends on the states in a Markovian way. In particular, if 
at time t > 1, the state is s and termination has not occurred yet, then the probability of r = t 
(terminating at time f) is 1 — 7 ( 5 ), for a given parameter 7 ( 5 ) G [ 0 , 1 ]. 

Eet Ptt denote the transition matrix of the Markov chain on S induced by vr. Eet E denote the 
N X N diagonal matrix with diagonal entries 7 ( 5 ), s € S. Denote by 7r(a | s) and 7 r"(a | s) the 
probability of taking action a at state s under the policy vr and 7r°, respectively. 


Assumption 2.1 (Conditions on the target and behavior policies) 

(i) The target policy vr is such that (/ — PttE)”^ exists (equivalently, termination occurs with 
probability 1 under vr, for any initial state). 

(ii) The behavior policy tt° induces an irreducible Markov chain on S, and moreover, for all 
(s, a) G S X A, 7r°(a | s) > 0 if7r(a | s) > 0. 


Under Assumption 2.1(i), we define fhe value function of fhe fargef policy vr by ; 5 


R, 


= 


EI=o Rt 


S^ = s 


where denofes expecfafion wifh respecf fo fhe probabilify 


4. ODE stands for ordinary differential equation. 

5. A stationary policy is a decision rule that specifies the probability of taking action a at state s for every s £ S. 
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distribution of the process of states, actions and rewards, {St, At, Rt), t > 0, induced by the policy 
TT. Let Ttt be the expected one-stage reward function under vr; i.e., r-j^is) = I Ro = ■s] for 

s G 5. Then the desired function v-jr can be seen to satisfy uniquely the Bellman equation^ 

U-TT P'jA^v-j^, i.e., (/ Ri-L) r^,-. 


2.2. Algorithms 

We consider computing with the ETD(A) algorithm (Sutton et al., 2015) and its least-squares 
version, ELSTD(A), using linear function approximation, while following the behavior policy tt°. 
Eet E C IR^ be the approximation subspace of dimension n, and let <1> be an x n matrix whose 
columns form a basis of E. The approximation problem is to find a parameter vector 0 G R” such 
that V = ^6 ^ E approximates Vn well. 

We express n = as v{s) = s G 5, where the superscript ^ stands for transpose, 

and (j){s) G R” is the transposed s-th row of <1> and represents the “features” of state s. Eike stan¬ 
dard TD(A), if a transition {s, s') occurs with reward r', ETD(A) and EESTD(A) use the “temporal 
difference” term, r' + 'y{s')(j){s')'^6 — (j){s)'^6, to adjust the parameter 0 for the approximate value 
function. Also like standard TD(A), these algorithms aim to solve a projected (single-step or multi- 
step) Bellman equation; but we shall defer the discussion of this until after describing the ETD(A) 
algorithm. 

We focus on a general form of the ETD(A) algorithm, which uses state-dependent A values 
specified by a funcfion A : 5 —)■ [0,1]. Inpufs fo fhe algorifhm are fhe sfafes, actions and rewards, 
{{St, At, Rt),t > 0}, generafed under fhe behavior policy 7r°, where Rt is fhe random reward 
received upon fhe fransifion from sfafe St fo St+i wifh acfion At. The algorifhm can access fhe 
following functions, in addition fo fhe fealures (j){s): 

(i) 7 : 5 [0,1], which specifies fhe ferminafion probabilifies (or equivalenfly, fhe sfafe- 

dependenf discounf factors) fhaf define as described earlier; 

(ii) A : 5 —)■ [0,1], which defermines fhe single or mulfi-sfep Bellman equafion for fhe algorifhm 
[cf. fhe subsequenf Eqs. (2.5)-(2.6)]; 

(iii) p : S X A ^ R+ given by p{s,a) = 7r(a | s)/7r"(a | s) (wifh 0/0 = 0), which gives fhe 
likelihood rafios for acfion probabilifies fhaf can be used fo compensafe for sampling sfafes 
and actions according fo fhe behavior policy 7r° insfead of fhe largel policy vr; 

(iv) i : 5 —)• R+, which gives fhe algorifhm addifional fiexibilify fo weigh sfafes according fo fhe 
degree of “inferesf” indicafed by i{s). 

The ETD(A) algorifhm does fhe following. Eor each t > 0, lef at G (0,1] be a sfepsize parame¬ 
ter, and fo simplify nofafion, lef 

Pt = p{St,At), it = l{St), \t = \{St). 

ETD(A) calculates recursively Ot G R”, t > 0, according fo 

0t+i = 6t + atef pt {Rt + it+i(l){St+i)~^Ot - (t){St)~^Ot), (2.1) 

6. One can verify this Bellman equation directly. It also follows from the standard MDP theory (see e.g., Puterman, 
1994), as by definition v,, here can be related to a value function in a discounted MDP where the discount factors 
depend on state transitions, similar to discounted semi-Markov decision processes. 
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where et G IR” (ealled the “eligibility traee”) is ealeulated together with two nonnegative sealar 
iterates {Ft, Mt) aeeording to:^ 


Ft = 7t pt-i Ft-i + i{St), 

( 2 . 2 ) 

Mt = \ti{St) + {l-\t)Ft, 

(2.3) 

et = At 7t pt-i et-i -F Mt 4>{St)- 

(2.4) 


For t = 0, (eo, Fq; ^o) are given as an initial eondition of the algorithm. 

We reeognize that the iteration (2.1) has the same form as standard TD, but the traee et is 
ealeulated differently, involving an “emphasis” weight Mt on the state St, whieh itself evolves 
along with the iterate Ft, ealled the “follow-on” traee. If Mt is always set to 1 regardless of Ft and 
i(-), then the iteration (2.1) reduees to the standard TD(A) in the ease where 7 and A are eonstants. 

To explain at a high level what ETD(A) aims to aehieve with the weighting seheme (2.2)-(2.4), 
let us diseuss the approximate Bellman equation it aims to solve. Assoeiated with ETD(A) is a 
generalized Bellman equation of whieh is the unique solution (Sutton, 1995):* 

V = + Pn,'y V- (2.5) 

Here is an N x N substoehastie matrix, and r^,y ^ is a veetor of expeeted total rewards 
attained by vr up to some random time depending on the funetions 7 and A, given by 

P 4 = /-(/-P^rA)-^(/-P^r), 4^^ = {I-P^rA)-^r^, ( 2 . 6 ) 

where A is a diagonal matrix with diagonal entries A(s), s € S. ETD(A) aims to solve a projeeted 
version of the Bellman equation (2.5) (see Sutton et ah, 2015): 

V = n(r^^^ + P^^^v), vGF, ^ C0 + b = O, 0GR”. (2.7) 

In the above, H is the projeetion onto F with respeet to a weighted Euclidean norm or seminorm. 
The weights that define this norm also define fhe diagonal enfries of a diagonal mafrix M, and are 
given by 

diag{M) = dlo i{I — P^,^)~^, wifh j e R^, d-„o t{s) = d^o(s) • i{s), s e S, (2.8) 

where d,r°('S) > 0 denofes fhe steady sfafe probabilify of slate s for fhe behavior policy vr", under 
Assumption 2.1(ii). Eor fhe corresponding linear equation in fhe 0-space in Eq. (2.7), 

C =-4>^M (I - 6 = (2.9) 

Imporlanl for fhe convergence of ETD(A) is fhe negafive definileness of C. If can be shown lhal 
under Assumption 2.1, C is negative definite whenever C is nonsingular (Prop. C.2, Appendix C).^ 
By comparison, if we sel Mt = 1 regardless of Ft and f(-), fhe weighls lhal define fhe projection 
norm and diag{M) would simply become dj^o, the same as in the regular off-policy TD(A). If we set 

7. For insights about ETD(A), see (Sutton et al., 2015; Mahmood et al., 2015). Our definition (2.4) of {et} differs 
slightly from its original definition, but the two are equivalent; ours appears to be more convenient for our analysis. 

8. For the details of this Bellman equation, we refer the readers to the early work (Sutton, 1995; Sutton and Barto, 1998) 
and the recent work (Sutton et al., 2015). 

9. Prior to our work, Sutton et al. (2015) proved the negative definiteness of C for positive ((•) under Assumption 2.1. 
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Mt = i{s), then the weights are given by Neither of these eases guarantees C to be negative 
definite, unless A is suffieiently elose to 1. Having the desirable negative definiteness property of C 
is one of the motivations for introdueing the weighting seheme (2.2)-(2.4) in ETD(A) (Sutton et ah, 
2015). 

For the eonvergenee analysis in this paper, we shall assume: 

Assumption 2.2 (Nonsingularity condition) The matrix C given in Eq. (2.9) is nonsingular. 

We remark that for ETD(A) under Assumption 2.1, C is always negative semidefinite (Sutton 
et ah, 2015) (ef. our Prop. C.l, Appendix C), and the nonsingularity eondition above is indeed 
equivalent to C being negative definite (Prop. C.2). This eondition is fairly mild and allows i{s) = 
0 for some states s. Speeifieally, as we prove in Appendix C (see Prop. C.2), Assumption 2.2 
is equivalent to a eondition on the approximation subspaee, whieh requires merely that the set 
of feature veetors of those states with positive emphasis weights eontains n linearly independent 
veetors (ef. Remark C.2). Moreover, this requirement ean be fulfilled easily without knowledge of 
the model (see Cor. C.l, Remark C.2). We also note that when C is negative definite, the projeetion 
n in Eq. (2.7) is well-defined (with respeet to a seminorm if in Eq. (2.8) some diagonal entries 
of M equal zero), the projeeted Bellman equation (2.7) has a unique solution, and bounds on the 
approximation error of ETD(A) ean be derived using the approaeh of Seherrer (2010). (For details 
of this diseussion, see Remark C.l in Appendix C.) 

The EESTD(A) algorithm aims to solve the same projeeted Bellman equation (2.7) as ETD(A). 
EESTD(A) ealeulates iteratively an n x n matrix Ct and a veetor bt G aeeording to 

Ct+i = (1 - at)Ct + atet ■ p*(2.10) 
bt+i = [I-at)bt + atef ptRt-, (2.11) 

where the traee e* is ealeulated aeeording to Eqs. (2.2)-(2.4) as in ETD(A). EESTD(A) sets 9t = 
—Ci[^bt, the solution to CtO + = 0, when Ct is invertible. 

Eike ETD(A), without the weighting seheme (2.2)-(2.4), EESTD(A) would reduee essentially to 
the regular ESTD(A) (see e.g., (Boyan, 1999; Yu, 2012) for on-poliey and off-poliey ESTD(A)). 

2.3. Convergence Results 

We analyze ETD(A) and EESTD(A) with diminishing stepsizes. Summarized below are their eon¬ 
vergenee properties, whieh we will establish in the rest of this paper. In what follows, we denote 
by II • II the infinity norm for both veetors and matriees (viewed as veetors). For different stepsize 
eonditions, our results will involve different eonvergenee modes: eonvergenee in in probabil¬ 
ity, or almost sure (a.s.) eonvergenee (we write for “eonverges almost surely”). First, we state a 
general stepsize eondition that we will use. 

Assumption 2.3 (Stepsize condition) The stepsize sequence {at} is deterministic and eventually 
nonincreasing, and satisfies at G (0,1], J2t ~ < oo. 

Under the above eondition we may take at = c G (1/2,1]. However, stepsizes deereasing 
as will be required in our almost sure eonvergenee results; some eases will require at = 0(1/f) 
with = O(1/f).^^ (For instanee, at = ci/{c 2 + t) for some eonstants ci, C 2 > 0.) 

10. For vector-valued random variables X, Xt,t > 0, by “{Xt} converges to X in C” we mean E[|| W — X||] *2^° 0. 

11. We write St = 0{l/t) for a scalar sequence {5t}, if for some c > 0, 0 < St < cjt for all t. 
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Our results are as follows. Let 9* denote the desired limit for ETD(A): 

9* = —C~^b, for C, b defined by Eq. (2.9) under Assumptions 2.1, 2.2. 

Theorem 2.1 {L^ and almost sure convergence of ELSTD(A) Iterates) 

Under Assumptions 2.1, 2.3, for any given initial (eo, Fq, Cq, bo), the sequence {{Ct, bt)} generated 
by the ELSTD(X) algorithm (2.10)-(2.11) converges in L^: 

lim E\\\Ct - Cin = 0, lim ENI^i - 6111 = 0. 

If in addition the stepsize is given byat = l/{t + 1), then Ct C, bt b. 

The preeeding theorem yields immediately the eonvergenee of the parameter sequenee {9t] 
generated by EESTD(A): 

Corollary 2.1 (Convergence of ELSTD(A)) Let Assumptions 2.1-2.3 hold. Let {9t} be generated 
by the ELSTD(X) algorithm (2.10)-(2.11) as 9t = —Cf^bt. Thenfor any given initial {eo, Fq, Co, bo), 
{9t} converges to 9* in probability; if in addition = l/(f + 1), then 9t 9*. 

Theorem 2.2 (Almost sure convergence of ETD(A)) Let Assumptions 2.1-2.3 hold. Let {9t} be 
generated by the ETD(X) algorithm (2.1) with stepsizes satisfying cq = 0{l/t) and = 

0{l/t). Then for any given initial {eo,Fo,9o), 9t 9*. 

Remark 2.1 (On stepsizes) We believe that the range of stepsizes for the a.s. eonvergenee of 
EESTD(A) ean be enlarged. If additional eonditions on the behavior poliey are imposed to re- 
striet the varianees of the traee iterates, it should also be possible to enlarge the range of stepsizes 
for ETD(A). These topies, as well as the use of random stepsizes, are under aetive investigation. 

Remark 2.2 (On variances) The preeeding eonvergenee results hold under almost minimal eon¬ 
ditions on the behavior poliey (Assumption 2.1(ii)). However, unless we restriet suffieiently the 
behavior poliey (whieh is diffieult to do without knowledge of the model, when 7 1 ), the vari¬ 

anees of the traee iterates ean grow unboundedly (ef. Remark A.l), signifieantly affeeting the speed 
of eonvergenee. This is a main diffieulty in off-poliey methods in general. Eurther researeh is 
required to overeome it. Eor a reeent work in this direetion, see (Mahmood et ah, 2014). 

3. Properties of Trace Iterates and Convergence Analysis of ELSTD(A) 

In this section we analyze the traee iterates and eonvergenee properties of EESTD(A) iterates. The 
analysis not only leads to Theorem 2.1 on the eonvergenee of EESTD(A), but also prepares the stage 
for the subsequent ODE-based eonvergenee proof for ETD(A), by ensuring that “local averaging” 
gives the desired “mean dynamics,” as will be seen in Section 4. 

The structure of our analysis will be similar to that of (Yu, 2012) for regular off-policy ESTD(A), 
but the proofs at intermediate steps are new and more involved. We will explain the key proof 
arguments in this section, and give the proof details and related results in Appendix A. 
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3.1. Properties of Trace Iterates 

Let Zt = {St, At, et, Ft) for t > 0; they form a Markov chain on 5 x ^ x First, we 

observe several important properties of the trace iterates {{et,Ft)} and the Markov chain {Zt}, 
under Assumption 2.1: 

(i) For any given initial (eo, Fq)? supj>o E[||(et, Ft) ||] < oo. (See Prop. A. 1.) 

(ii) Let {{et,Ft)} and {(et,F)} be defined by the same recursion (2.2)-(2.4), using the same 
state and action random variables, but with different initial conditions (cq, Fq) / (cq, Fq). 
Then Ft — F 0 and et — et 0 (the zero vector in IR”). (See Prop. A.2.) 

(iii) We can approximate the traces {et. Ft), which depend on the entire history of past states and 
actions, by similarly defined “truncated traces” {et^K, Ft,K) which depend on the most recent 
2K states and actions only [cf. Eqs. (A.13)-(A.15)]. The expected approximation “error” 
can be bounded uniformly in t, by a constant Lk which decreases to 0 as F — )■ oo. (See 
Prop. A.3.) 

(iv) {Zt} is a weak Feller Markov chain^^ and bounded in probability,^^ and hence it has at least 
one invariant probability measure. 

Furthermore, as we will show in Theorem 3.2 below, {Zt} has a unique invariant probability mea¬ 
sure and is ergodic. 

These properties suggest that despite the growing variances, the trace iterates are well-behaved. 
Figure 1 shows how the convergence results of this section, to be introduced next, will depend on 
these properties. 

Property (iii) Properties (ii), (iv) Property (i) 

1 1 I 

Theorem 3.1 -► Theorem 3.2(i)-► Theorem 3.2(ii) 


Theorem 3.2 

Property (ii) -►Theorem 3.3 Property (ii) -► Proposition 3.1 (ii) 



Property (iii) 


Proposition 3.1(i) 


Theorems 3.1,3.3 
Proposition 3.1 


Theorem 2.1 


Property (i) 

Theorems 3.1, 3.3, 2.1-► Theorem 4.1 -► Theorem 2.2 

Proposition 3.1(ii) 


Figure 1: Diagrams showing dependence relations between the results in this paper. “A —)■ B” 
means A is used in proving B. 


12. A Markov chain {Xt} on a metric space is weak Feller if E[/(Ai) | Xq = x] is continuous in x for every bounded 
continuous function / on the state space (Meyn and Tweedie, 2009, Prop. 6.1.1(i)). Using this and the fact that 
(ei, Fi) depends continuously on (eo, Fq) [cf. Eqs. (2.2)-(2.4)], the weak Feller property of {Zt} can be seen. 

13. A Markov chain {Xt} on a topological space is bounded in probability if, for each initial state x and each e > 0, 
there exists a compact subset D of the state space such that liminft_>oo Pa,(Xt G D) > 1 — e, where P^; denotes 
the probability of events conditional on Xq = x (Meyn and Tweedie, 2009, p. 142). In our case, since S and A are 
finite, the property (i) above together with the Markov inequality implies that {Zt} is bounded in probability (cf. Yu, 
2012, Lemma 3.4). 

14. By (Meyn and Tweedie, 2009, Theorem 12.1.2(ii)), a weak Feller Markov chain bounded in probability has at least 
one invariant probability measure. We mention that there is also an alternative, direct proof of the existence of an 
invariant probability measure for {Zt}, which does not rely on the weak Feller property (see Appendix A.6). 
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3.2. Main Results on and Almost Sure Convergence 

We formulate our eonvergenee results in terms of a general reeursion that ean be speeialized to 
the ELSTD(A) iteration. This generality is needed in order to make the results useful for other 
proofs, speeifieally, for proving the uniqueness of the invariant probability measure of {Zt}, and 
for establishing eonvergenee eonditions required by an ODE-based analysis for ETD(A), as those 
proofs will rely on the eonvergenee properties of eertain iterates that are different from EESTD(A). 

We define the general reeursion just mentioned as follows. Denote y = (e, F); thus y 
Consider a veetor-valued funetion h : x 5 x Al x 5 —)■ R™ sueh that h{y, s, a, s') is Eipsehitz 

eontinuous in y for eaeh (s, a, s'); i.e., there exists some eonstant sueh that for any y, y G 

\\h{y,s,a,s') - h{y,s,a,s')\\ < Lh\\y - y\\, y {s, a, s') G S x A x S. (3.1) 

Given h, {Zt} and the stepsizes {at}, we define a reeursion as follows: 

Gt+i = (1 — at) Gt + at h{Yt, St, At, St+i). (3.2) 

The EESTD(A) iferafes Ct and bt eorrespond fo fhe following ehoiees of h, respeefively: 

hi{y,s,a,s') = e • p{s,a) (7(s')</>(s')"^ - h 2 {y,s,a,s') = e ■ p{s,a)r{s,a,s'). (3.3) 

Here hi is mafrix-valued (we view if as an R™-valued funefion wifh m = n x n), and /i 2 is Re¬ 
valued. As jusf menfioned, we will also need fo eonsider ofher ehoiees of h in our proofs lafer. 

We firsf show fhaf {Gt} eonverges in fo some eonsfanf veefor. The proof (given in Ap¬ 
pendix A.3) exploifs fhe properfy (iii) of fruneafed fraees menfioned earlier: fhis properfy allows us 
fo obfain fhe desired resulf by working wifh simple finife-slafe Markov ehains. 

Theorem 3,1 (L^ -convergence of {Gt}) Let hbe a vector-valued function satisfying the Lipschitz 
condition (3.1), and let {Gt} be defined by the recursion (3.2), using the process {Zt}. Then under 
Assumptions 2.1, 2.3, there exists a constant vector G* (independent of the stepsizes) such that for 
any given initial To = (eo; Ft) and Go, limt^,oo E[||Gt — G*||] = 0. 

Nexf we analyze fhe a.s. eonvergenee of {Gt}, by using ergodieify properfies of fhe infinife- 
spaee Markov ehain {Zt} fhaf we esfablish firsf. Eor eaeh inifial eondifion Zq = z, define fhe 
oeeupafion probabilify measures pz,t for f > 1, by pz,t{B) = j Yl\=i ^B^Zk) for any Borel subsef 
Hof 5 X A X where 1 b denofes fhe indieafor funefion for fhe sef B (i.e., 1 b(x) = 1 if x € B, 
and 1b (x) = 0 ofherwise). Eef denofe expeefafion wifh respeef fo fhe probabilify disfribufion of 
fhe proeess {Zt} wifh p as fhe inifial disfribufion of Zq. 

Theorem 3.2 (Ergodicity of {Zt}) Under Assumption 2.1, the Markov chain {Zt} has a unique 
invariant probability measure (}, and moreover, the following hold: 

(i) For each initial condition Zq = 7 the sequence {pz,t} of occupation measures converges 
weakly^^ to (, almost surely. 

(ii) [||/i(Zo, *S'i)||] < oofor any function h satisfying the Lipschitz condition (3.1). 

15. For probability measures p, pt,t > 0, on a metric space X, {pt} is said to converge weakly to p if for all bounded 
continuous functions f on X, f fdpt —>■ / fdp as f t oo (Dudley, 2002, Chap. 9.3). 
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The preceding theorem follows from the properties of trace iterates given earlier and Theo¬ 
rem 3.1 (cf. Figure 1). The proof is the same as the corresponding proofs of (Yu, 2012, Theorem 
3.2 and Prop. 3.2) for the case of off-policy LSTD. In particular, to prove the uniqueness of the 
invariant probability measure (which is not as easy to prove as the existence given in the property 
(iv) earlier), we use the property (ii) and the convergence in result given in Theorem 3.1.^^ 

We can now show that {Gt} converges a.s. for stepsize a* = l/(f + 1), by using the preceding 
results (cf. Figure 1), together with a strong law of large numbers for stationary processes (Doob, 
1953, Chap. X, Theorem 2.1) (see also Meyn and Tweedie, 2009, Theorem 17.1.2). The proof is a 
verbatim repetition of the proof of (Yu, 2012, Theorem 3.3) and is therefore omitted. 

Theorem 3.3 (Almost sure convergence of {Gi}) Let h and {Gt} be as in Theorem 3.1, and let 
the stepsize be at = l/{t + 1). Then, under Assumption 2.1, for any given initial Yq = (eo, Fq) and 
Go, Gt G*, where G* = [h{Yo, So, Aq, 5i)] is the constant vector in Theorem 3.1. 

Finally, we also need to analyze the cumulative effects of noise in the observed rewards Rt and 
show that they diminish asymptotically. To this end, consider the following recursion: Wo = 0 and 

Wt+i = (1 - at) Wt + at et pt • ujt+i, t>0, (3.4) 

where oJt+i = Rt — r{St, At, 5't+i) are noise variables. 

Proposition 3.1 (Effects of noise in random rewards) Under Assumptions 2.1, 2.3, for any given 
initial {eo,Fo), we have (i) E[||VFt||] —)■ 0; and (ii) if, in addition, the stepsize is at = l/(t + 1), 
then Wt 0. 

The proof of the preceding proposition is given in Appendix A.4. The proof of part (i) uses the 
property (iii) of truncated traces, similarly to the proof of Theorem 3.1, and the proof of part (ii) is 
similar to that of Theorem 3.3 (cf. Figure 1). 

The convergence of ELSTD(A) stated in Theorem 2.1 now follows from the preceding results 
(cf. Figure 1). Specifically, we calculate the limit G* in Theorem 3.1 for the two functions hi, /12 in 
Eq. (3.3), which are associated with the EESTD(A) iterates {Gt}, {bt}, respectively, and we show 
that G* = G for h = hi and G* = b for h = h 2 . We also write the iterates {bt} equivalently 
as bt+i = Gt+i + Wt-i-i with /i = /i 2 in the definition of {Gt}. Then, the L^-convergence part 
of Theorem 2.1 follows from Theorem 3.1 and Prop. 3.1(i), and the a.s. convergence part of Theo¬ 
rem 2.1 follows from Theorem 3.3 and Prop. 3.1(ii). The complete proof with all the details is given 
in Appendix A.5. 

4. Convergence Analysis of ETD(A) 

Recall that ETD(A) calculates iteratively 9t, t > 0, according to 

0t+i = 6t + atef Pt [Rt + At+if{St+i)~^9t - (t){St)~^Of). (4.1) 

Using the results of Section 3, we can now analyze its convergence by applying a “mean ODE” 
method from stochastic approximation theory (Kushner and Yin, 2003). 

16. Theorem 3.1 is useful here because on the separable metric space 5 x A x IR"+^, bounded Lipschitz continuous func¬ 
tions are convergence-determining for weak convergence of probability measures (Dudley, 2002, Theorem 11.3.3). 
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Denoting ojt+i = Pt {Rt — ^{St, At, St+i)), let us write the iteration (4.1) equivalently as 

0t+i = 6t +at h{0t, ^t) + at et- Cjt+i, (4.2) 

where ^t = (et, St, At, St+i) and /i : x x 5 x ^ x 5 —)■ IR"^ is given by 

h{0, 0 = e • p{s, a) {r{s, a, s') + 7 ( 5 ') 4'{s')~'^0 - (j){s)~'^0), for ^ = (e, s, a, s'). (4.3) 

We will apply (Kushner and Yin, 2003, Theorem 6.1.1) to analyze the eonvergenee of {0t} generated 
by (4.1). The “mean ODE” assoeiated with ETD(A) (4.1) is 

X = h{x), where h{x) = Cx + b. (4.4) 

When C is negative definite, the above ODE has a unique bounded (eonstant) solution x{-) = 0* = 
—C~^h on the time interval (—00, +00), and 0* is globally asymptotieally stable for (4.4) in the 
sense of Eiapunov (cf. Kushner and Clark, 1978, p. 23-24). (A Eiapunov funetion in this ease is 
given by ||0 — Hi. where || • II 2 denotes the Euelidean norm.) 

However, the a.s. boundedness of {0t} is not easy to prove direetly, whieh has prevented us from 
getting the desired eonvergenee 0t —)■' 0* from (Kushner and Yin, 2003, Theorem 6.1.1) direetly. 
Eor this reason, we analyze first a eonstrained version of (4.1) and establish its eonvergenee. The 
result will then help the eonvergenee analysis of the uneonstrained algorithm (4.1) in Seetion 4.2. 

4.1. Convergence of Constrained ETD(A) 

Consider the following eonstrained ETD(A) algorithm: 

0t+i = Ub (^0t + at h{0t, 6 ) + atef ojt+i^ , (4.5) 

where H is a elosed ball in R'^ with a suffieiently large radius r: B = {0 G R"^ | || 6*||2 < r}, 

and Hfi is the Euelidean projeetion onto B. The “mean ODE” assoeiated with the eonstrained 
algorithm (4.5) is the projeeted ODE 

X = h{x) + z, z G —Mb{x), (4.6) 

where Mb (a^) is the normal eone of B at x, and z is the boundary refleetion term that eaneels out the 
eomponent of h{x) in Mb{x) and is the “minimal foree” needed to keep the solution in B (Kushner 
and Yin, 2003, Chap. 4.3). The negative definiteness of the matrix C implies that the projeeted ODE 
(4.6) has no stationary points other than 0* if the radius of B is suffieiently large: 

Lemma 4.1 Let c > 0 be such that x~^Cx < —c||x|||/or all x G R*^. Suppose B has a radius 

X > ll&lb/c. Then 0* lies in the interior of B, and the only solution x{t),t G (—00,+00), of the 

projected ODE (4.6) in B is xf) = 0*. 

The proof of Eemma 4.1 is given in Appendix B. We now apply (Kushner and Yin, 2003, 
Theorem 6.1.1) and Eemma 4.1 to prove the a.s. eonvergenee of the eonstrained ETD(A) as stated in 
the theorem below. The proof is given in Appendix B, and it uses the results of Seetion 3 to verify 
the eonditions required by (Kushner and Yin, 2003, Theorem 6.1.1). 

Theorem 4.1 (Almost sure convergence of constrained ETD(A)) Let Assumptions 2.1-2.3 hold. 
Let {0t} be the sequence generated by the constrained ETD(X) algorithm (4.5) with stepsizes sat¬ 
isfying at = 0(l/t) and = 0{l/t), and with the radius r of B exceeding the threshold 

given in Lemma 4.1. Then, for any given initial (cq, Eq, ho), 0t 0*. 
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4.2. Convergence of ETD(A) 

We now prove the eonvergenee theorem, Theorem 2.2, for the uneonstrained ETD(A) algorithm by 
using the eonvergenee of the eonstrained algorithm we just established. In partieular, we shall eom- 
pare the iterates generated by the uneonstrained algorithm with those generated by the eonstrained 
one, and show that the differenee between them diminishes asymptotieally with probability one. 

Let i? = {0 G I ||0||2 < r} with its radius r satisfying the eondition of Lemma 4.1. Note 
that to projeet 6 onto B is simply to seale 6: = 0 if || 0||2 < r; and 11^0 = r • 0 /|| 0||2 if 

II 6*11 2 > r. More eoneisely, 

IlB9 = ri6, where r/= min{l, r/|| 6 *|| 2 }. 

To simplify notation, define matrix Ht and veetor gt by 

Ht = Sf pt (7t-ti 0(5't+i) - (piSt)^, gt = et- pt Rt- 

Let us write the eonstrained algorithm (4.5) equivalently as 

0t+i = {I + atHt) ■ rit0t + atgt, (4.7) 

where tjo = I and rjt = niin{l, r/||0tII 2 } for t > 1. (Lor t > I, Tjt 0t eorresponds to the projeeted 
iterate in (4.5), and 9t the iterate just before the projeetion.) The uneonstrained algorithm (4.1) ean 
be equivalently written as 

9t+i = {I + oit Ht) ■ 9t + at gt- (4.8) 

Lemma 4.2 Under the conditions of Theorem 4.1, for any given initial (cq, Lq), almost surely, the 
sequence of matrices, nfc>t Hk), t = t,t + f, ■ ■ ■, converges to the n x n zero matrix as 

f — )■ 00 , for all f > 0. 

Proof It is suffieient to eonsider a given (arbitrary) veetor y ^ and prove that for eaeh initial 
(eo, Fo) and eaeh i > 0, nfc>t Hk) U ^ 0. To this end, eonsider generating the iterates 

9t, 0i+i, ■ ■ ■, starting from time t and 9t = y, by using the eonstrained algorithm (4.7) as follows: 

h+i = {I + akHk) ■ gk9k, k>i. 

In the above, we ealeulate {ek,Fk) and Hj. as before starting from time 0 and the given initial 
eondition (eo, Ff), and we have set gk = Rk = fi for all k. Notiee that sinee the stepsize sequenee 
{at} satisfies fhe eondifion of Theorem 4.1, so does fhe sfepsize sequenee, aj+i, af_|_ 2 , • •.. Then, 
in view of fhe Markovian properfy of {{St, At, et. Ft)}, we ean apply Theorem 4.1 fo fhe above 
iferafion sfarfing from fime t for eaeh possible value of (ej, F^), fhereby eoneluding fhaf for fhe 
given (eo, Fq) and i, 9t 0 (beeause Rk = 0 for all k and fhe solution fo C9 = 0 is 0). 

On fhe ofher hand, 

Ot+i = (ny • (ni>t vk) ■ y. (4.9) 

Sinee fhe solufion 0 lies in fhe inferior of B, if 0t —)■ 0, fhen = 1 for all k suffieienfly large. 
Thus fhe eonvergenee 9t 0 implies fhaf as f —?■ 00 , Ylk^tVk eonverges a.s. fo a sfriefly posifive 
number fhaf depends on fhe sample pafh and fhe veefor y. Consequenfly, from Eq. (4.9) and fhe 
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convergence 9t 0, we obtain that ^nfc>t (-^ + Hk)j y 0 as t —)■ oo. Now this holds for 

any given vector y, so by letting y be each column of the identity matrix, it follows that as f —)■ oo, 
the matrix nfc>t (-^ + ctfc ^k) converges a.s. to the zero matrix. ■ 


Finally, we prove the a.s. convergence of the unconstrained ETD(A) as stated by Theorem 2.2: 
Proof of Theorem 2.2 Let {6t} be the iterates generated by the constrained algorithm (4.7) using 
the same trajectory of states, actions and rewards that are used by the unconstrained algorithm (4.1) 
to generate {Ot}- By Theorem 4.1 and Lemma 4.2, there exists a set fli of sample paths such that 
rii has probability one and on fli, 

Ot 9* and linr {I + "fc Hj,) = Onxn, Vf > 0, 


where Onxn denotes the nxn zero matrix. Consider each path in fli. By our choice of the constraint 
set B, 9* lies in the interior of B (Lemma 4.1), so the convergence 9t —)■ 9* implies the existence of 
a path-dependent time t' < oo such that = 1 for all k >t'. Then 

0k+i = {I + ak Hk) ■ Ok + Oik gk, y k> t', 


and consequently, 

= {I + Ok Hk) ■ {9k - 9k), V /c > t', 

9t+i-9t+i = {\\\yAl + akHk))-{9t'-9t'), Vf > f'. (4.10) 


As t — )■ oo, the matrix nfc>r' + ^k Hk) —> Onxn for the sample path under consideration. Thus, 
from Eq. (4.10) we obtain 9t — 9t ^ 0; since 9t —)• 9*, this implies 9t —)• 9*. ■ 


Remark 4.1 (Almost sure convergence of regular off-policy TD(A)) If A is a constant sufficiently 
close to 1, the matrix associated with the “mean updates” of the regular off-policy TD(A) algorithm 
is also negative definite (Bertsekas and Yu, 2009). In that case, (Yu, 2012, Prop. 4.1) established 
the a.s. convergence but only for a constrained version of the algorithm, similar to our Theorem 4.1. 
The proofs given in this subsection, combined with (Yu, 2012, Prop. 4.1), can be used to establish 
the desired a.s. convergence for the unconstrained off-policy TD(A) in that case. 
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Appendix A. Proof Details for Section 3 

In this appendix we give proof details and related results for Seetion 3. Assumption 2.1 on the 
target and behavior polieies will be in foree throughout, so it will not be mentioned explieitly in 
intermediate teehnieal results. 

A.l. Some Basic Technical Lemmas 

We prove three basie lemmas that will be useful later. First, reeall that F and A are diagonal matriees 
with 7 (s) (diseount faetors) and A(s), s G 5, on their diagonals, respeetively. Note also that under 
Assumption 2.1, the inverse (/ — PttF)”^ exists. This implies that (I — also exists. Then, 

sinee 

oo oo 

{I - P^r)-1 = Y.{P^T)\ (/ - P.FA)-! = ^(P^FA)*, 

t=o t=o 

both (Pn-L)* and (P^rFA)* eonverge to the zero matrix as f —)• oo. 

We now speeify some notation. In what follows, let 1 denote the veetor of all ones. For an 
expression H that results in a veetor in IR^, we will write {H){s) for the s-th entry of the resulting 
veetor. (For example, (Pn-l)(s) and (l^P^)(s) represent the s-th entry of the veetor P^-l and l^Pn-, 
respeetively.) 

Let Pt = (t(S'oi Aq, ..., St) be the fi-algebra generated by the states and aetions up to time 
t, ineluding the state St but exeluding the aetion At. Reeall some shorthand notation we defined 
earlier: 

p, = p{St,At) = ^^^^, 7t = l{St), Xt = X{St). 

To simplify notation, let us also define for f > 1, 

A = Pt-i It At. 


Lemma A.l For allt > k >0, 


E[pkjk+i • ■■Pt-iit I = {iPnrY-^i)iSk) < 1, (A.l) 

E[A+iA+2 • • • a I = ((P.FA)'-'=l)(Sfc) < 1. (A.2) 

Furthermore, as t ^ oo, nUi {Pk-ilk) 0 and nUi f^k 0. 

Proof The firsf fwo equations follow simply from a direef ealeulafion. 

Lef At = 01=1 {Pk-i7k)- To prove At 0, eonsider equivalenfly fhe iterates 

Ao = 1, At = {pt-i'yt)At-i, t>l. 

Clearly At is Pt-measurable, and by Eq. (A.l) wifh k = t — 1, 

E[At I Pt-i] = At_i • E[/5t-i7i I -A-i] < At_i. 

So {(At,Pt)} is a nonnegafive supermartingale wifh E[Ao] = 1 < oo. By a eonvergenee fheorem 
for nonnegafive supermarfingales (Neveu, 1975, Theorem II-2-9), At Aqo for some nonnegafive 
random variable Aqo satisfying E [Aqo] < liminft_s.oo E [At]. From Eq. (A.l) wifh /c = 0, we have 
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E[Aj] < l^(P^r)*l —)■ 0 as t —)■ oo; therefore, E[Aoo] = 0. This implies Aqo = 0 a.s., i.e., 
At 0. 

The assertion nA:=i 0 follows similarly by considering the iterates A* = /3t At_i with 

Aq = 1, and by using Eq. (A.2) together with the nonnegative supermartingale convergence argu¬ 
ment. ■ 


Lemma A.2 For A: > 0, let be an Tk-measurable nonnegative random variable. Then for t > k, 


E[yfc • (Pkjk+I • ■■Pt-I7t)] < nYk] • (A.3) 

E[n • (/3fc+i/3fc+2 • • • A)] < EM • (A.4) 

Hence, if for some constant L < oo, E[yfc] < L for all k, then 


E 


■ t 

^ Yk ■ {Pklk+i ■ ■ ■ Pt-ilt) 
.k=0 


t 


E 


Yk • {(3k+i(3k+2 ■■■ f3t) 

.k=0 


1 < oo, 

k=0 J 

(A.S) 

t \ 

^(p^rA)M 1 <oo. 

7=0 / 

(A.6) 


Proof For any A: > 0, ((PttT)* ^l)(5fc) < l^(P,rr)* ^1. Using this, Eq. (A.l) in Lemma A.l, 
and the assumption that Yk is -measurable, we have 

E [Yk • {pklk+i • • • Pt-i7t)] = E[n • E [(pfc7fc+i • • • Pt-ilt) Mfc] ] < E[yfc] • . 

This proves Eqs. (A.S), (A.S). Similarly, Eqs. (A.4), (A.6) are obtained by using Eq. (A.2) in 
Eemma A. 1 and a direct calculation. ■ 


Lemma A.3 Let {ak} and {ck} be two sequences of nonnegative numbers with Ofc < oo 

Y1T=1 < oo. Then limt^oo Yfk=i Ct-k = 0. 


Proof For any m <t, 

t m t t—1 t 

^ ^ Q— ^ ^ c^t—k ^ ^ ^k Cf—k ^ ^ ^ Ck (maxc/c) * ^ ^ c>k‘ 

k=l k=l fc=m+l k=t—m k=m-\-l 

Since {ak} and {ck} are summable by assumption, we have max^ < oo, max^ Ck < oo, and 
limt^oo Y{j^k=t-m ^k = 0. So if we fix m and let t go to infinity in the preceding inequality, we have 



OO 

k=m-\-l 


Since lirnm-s-oo YlT=m+i Ofc = 0 by the summable assumption, by letting m go to infinity in the 
right-hand side above, we obtain limj^oo Xfc=i U-fc = 0. ■ 
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A.2. Properties of the Trace Iterates {(e*, Ft)} 

In this subsection we state formally the properties of trace iterates which we mentioned in Sec¬ 
tion 3.1, and we give their proofs. These properties will be used frequently in obtaining some of our 
main convergence theorems. 

The following proposition is the property (i) mentioned in Section 3.1. First, let us express the 
traces et, Ft, by using their definitions [cf. Eqs. (2.2)-(2.4)], as 

t 

Ft = Fq- (po7i • • • Pt-i7t) + i{Sk) ■ {Pklk+i • • • Pt-i^t) , (A.7) 

fc=i 

t 

e* = eo • (/?! • • • f3t) + ^ Mk ■ (piSk) ■ {f^k+i ■ ■ ■ Pt), (A.8) 

k=l 

where f3k = Pk-ilk^k as defined in fhe previous subsection, and 


Mk = Afc i{Sk) + (1 - Afc) Fk- 

Lef Ft = a{So, Aq,..., St) for t > 0, fhroughouf fhis subsection. 

Proposition A.l For any given initial (eo, Fq), supj>o E [|| (et, Ft)\\] < oo. 

Proof Lef us calculafe E[Ft] and E[||et||]. Since fhe number of sfafes is finife, fhere exisfs a finife 
consfanf L > 0 such fhaf L > ll(eo,-^o)|| and L > i{s), L > ||(()(s)|| for all sfafes s. Using fhe 
expression (A.7) for Ft and applying Eq. (A.5) in Lemma A.l (wifh Yq = Fq, Yk = i{Sk), k > 1), 
we have fhe bound 

E[Ft] <L1^ 1 < L ■ {I - P^T)-^1. 

\k=0 / 

Thus supt>o E[Ft] < oo. We now calculafe E[||et||]. Using fhe expression (A.8) for et, and using 
also fhe facf Mk < L + Fk, we can bound ||et|| by 

t 

||etII < L ■ (/3i • • • Pt) + L • {L + Fk) ■ [(3k+iPk+2 • • • A) • 

k=l 

Using fhe facf sup;j>g E [Fk] < L' for some finife consfanf L' as we jusf proved, and using also 
Eq. (A.6) in Lemma A.2 (wifh Yq = L,Yk = L{L + Fk), k > 1), we obfain 

E[||et||] < L{L + L' + 1) • I ^(P^rA)M 1 < L(L + L' + 1) • 1^ {I - P^rA)-^. 

\fc=o / 

Hence supt>o E[||et||] < oo. Since ||(et,Pt)|| < ||et|| + P), this shows fhaf 
supE[||(et,Pt)||] < supE[||et||] +supE[Pt] < oo. 

t>0 t>0 t>0 


The proof is complefe. 
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Recall that {Zt} with Zt = {St, At, et, Ft) denotes the Markov chain on the joint space S x 
A X of states, actions and traces, and it is a weak Feller Markov chain (cf. Footnote 12). As 
explained in Section 3.1, since S and A are finite, the preceding proposition implies that {Zt} is 
bounded in probability and hence, by its weak Feller property, has at least one invariant probability 
measure. We will need the following result (which is the property (ii) in Section 3.1) to prove that 
{Zt} has a unique invariant probability measure. 

Let {it. Ft), t > 1, be defined by the same recursion (2.2)-(2.4) that defines {et. Ft), using fhe 
same sfafe and acfion random variables, buf wifh a differenf initial condifion (cq, Fq)- We wrife a 
zero vector in any Euclidean space as 0. 

Proposition A.2 For any two given initial conditions (cq, Fq) and (cq, Fq), 

Ft- Ft ^ 0, et - et ^ 0. 

Proof Using fhe expression (A.7) for Ft and Ft, we have Ft — Ft = (Fq 
S ince nA:=i(f'fe-i7fc) 0 by Lemma A.l, if follows fhaf Ft — Ft 0. 

The proof of Lemma A. 1 also shows 

E[\Ft-Ft\] < |Fo-Fo| -l^lF^r)*!. 

We will need fhis inequalify below. 

We now prove et — it A-' 0. To simplify fhe derivafion, we firsf observe fhaf if eo A buf Fq = 
Fq , fhen Ft = Ft for all t, so fhe expression (A. 8) for et and it gives 11 e* — 11 = 11 eo — cq 11 • Ylk=i h- 
Since 01=1 Ak ^ 0 by Lemma A.l, if follows immediafely fhaf in fhis case ||et — e^H 0. 

Thus, for fhe general case, we can focus on fhe difference befween et and it fhaf is due fo 
fhe difference befween fhe inifial Fq and Fq. In parficular, define anofher sequence of iferafes 
{it, Ft) using fhe same recursion (2.2)-(2.4) buf wifh fhe inifial condifion (eo, Fo) = (eo, Fq). Since 
11 et — et 11 < 11 et — et 11 + 11 et — et 11 and 11 et — et 11 0 by whaf we jusf proved, fo show et — et 0, 

if is sufficienf fo prove ||et — et|| '^'0. 

Since Fo = Fo> the sequence {Ft} coincides wifh {Ft}. Then by fhe definition of et and it, 
et-it = At (et-i - et-i) + (1 - At) (Ft - Ft) • <j){St)- 
Since 0 < E [/3t | Ft_i] < 1 (Lemma A.l) and 0 < 1 — At < 1, if follows fhaf 

E[||et-et|| |Ft_i] < ||et_i - et-i|| + Ft-i, where Ut-i = E[|Ft - Ft| • ||,/.(5t)|| | Ft_i]. 

(A. 10) 

Lef us show X)t^o Zt < oo a.s. In view of Eq. (A. 10), fhis will fhen imply, by a convergence fheorem 
in (Neveu, 1975, Ex. II-4, p. 33-34) for nonnegafive random processes (which is a consequence of 
fhe nonnegafive supermarfingale convergence fheorem), fhaf 11 et — et 11 converges a.s. fo a finife limit. 

To prove X)t^o Zt < oo a.s., it is sufficient to show E [ ^] < co. Let L = max^g^ ||(/)(s)||. 

By Eq. (A.9), for each t, 

E[Yt] =E[|Ft+i-Ft+i|-||</>(5t+i)||] < L\Fq - Fo\ ■ {P^rf+A. 

Since X:t=o(l^(^-n'+^l) < l^(/-F^r)-tl < oo, we obtain E [ Ft] = Et^o^N < 

oo. Hence XlSo Yt < oo a.s., and as discussed earlier, this implies that 

II - II A 

W^t ^t\\ ^oo 


- Fo) ■ nLi(pfc-i7fc)- 

(A.9) 
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for a nonnegative real-valued random variable Aqo- 

What remains to be proved is Aqo = 0 a.s. By Patou’s lemma (Dudley, 2002, Theorem 4.3.3), 

IE[Aoo] < liminf E[||et - et||]. (A.ll) 

t^oo " " 

We show that the right-hand side equals 0. By a direet ealeulation (using Eq. (A.S) and the faet 
eo = eo), we ean write 


t 

et - et = J2^{Sk) • {Fk - Ffc) • (1 - A^) • (/3fc+i • • • A). 

k=l 

For eaeh k > 1, using Lemma A.2 and Eq. (A.9), we have 

E[|Ffc-Ffc| -(l-Afc)- (/?fc+i---/3i)] < E[\Fk-Fk\] • 

< |Fo - Fo| • (l^(P^r)'=l) • (lT(PJA)*-^l). 

From the preeeding two relations, it follows that 

t 

E[\\et - et\\] < L\Fo - Fo| • (l^(P^r)'^l) • {P^TAY-^^P^) . (A.12) 

k=l 

From Lemma A.3 with {a^} and {cfc} defined as = l^(P 7 rr)^l and = l^(P 7 rrA)^l for 
/c > 1, we have 

t 

lim y (i^(p^r)*^i) • (i^(p^rA)‘-^i) = 0. 

t^oo ^^ ^ ' 

k=l 

Combining this with Eq. (A.12) gives lim inf t_^oo —ei||] = 0, and eonsequently, E[Aoo] =0 

by Eq. (A.ll). This implies Aqo = 0 a.s., i.e., \\et — et\\ 0. ■ 

The next proposition is the property (iii) mentioned in Seetion 3.1, whieh eoneerns approximat¬ 
ing the traee iterates {et, Ft) by truneated traees that depend on a fixed number of fhe mosf reeenf 
sfafes and aefions only. We will use fhis proposifion subsequenfly fo prove Theorem 3.1: if allows us 
fo work wifh simple finife-spaee Markov ehains, insfead of working wifh fhe infinile-spaee Markov 
ehain {Zt} direefly. 

For eaeh integer iP > 1, we define fhe fruneafed fraees Yt^x = {et,K, Ft^x) as follows: 

Yt^x = {et,Ft) for t<K, 

and for f > AT + 1, 


t 


Ft,x = 

^ ^ f ('S’fc) • (pA:7A:-I-1 ‘ ‘ ‘ Pt—l'Jt) , 

(A.13) 


k=t-K 


Mt,x = 

Xti{St) + {1 - Xt)Ft,x, 

(A. 14) 

et,K = 

t 

y Mk,x ■ (i^{Sk) ■ {Pk+i ■ ■ ■ Pt)- 

(A. 15) 


k=t-K 
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Denote the original traees by Yt = (et, Ft)', reeall that they ean be expressed as in Eqs. (A.7)- 
(A.8). We have the following result, in whieh the notation “Lk i 0” means that Lk deereases 
monotonieally to 0 as AT —)■ oo, and in whieh Zq = {Sq, Aq, eo, Fq) as we reeall: 


Proposition A.3 

(i) For any given initial Yq = (eo, Fq), there exist constants Lk, K > 1, with Lk i 0, such that 

^[\\Yt-Yt,K\\]<LK, Vf>0. 


(ii) There exist constants Lk ,K>1, independent of the given initial value of Zq, such that 
Lk i 0 and 

E [\\Yt,K' - yt,K\\] <Lk, yK'>K,t> 2K'. 


Proof Let L = max{Fo, max^g^ iCs)}- We first ealeulate Ft — Ft^K- By definition Ft — Ft,K = 0 
for t < K. For t > K + 1, using the expressions (A.7), (A. 13) of Ft and Ft^K, we have 

t-K-l 

Ft - Ft,K = Fq • (poTi • • • Pt-ilt) + ^ i{Sk) • {pklk+i • • • Pt-Ilt ), 

fc=i 


from whieh it follows by applying Eq. (A.3) in Lemma A.2 that 

/ t \ / ' 


E 


\Ft - Ft 


t,K\ 


def j{l) 


< L • (P^r)’^ 1 < L • Y (PnL)’^ 1 = L 


K ■ 


(A.16) 


\k=K+l 


Kk=K+l 


Similarly we bound et — h,K- By definition et = et^K for t < K. For t > K + 1, using the 
expressions (A.8), (A. 15) of et and et^K, and using also the expressions (2.3), (A. 14) of Mt and 
Mt,K, we have 


t-K-l 

et - h,K = eo • (/?! • • • A) + Y ■ ■ ■ ^*) 

k=l 


t 

+ ^ f{Sk)-{Fk-Fk,K)'il-Xk)'{^k+i'"f3t). 

k=t-K 


By Prop. A.l, sup;i.>o E[Mfc] < oo, so we ean find a eonstant L' < oo that is greater than ||eo||, 
max^g^ ||</>(s)|| and (max^g^ ||<^(s)||) • sup;.>o E[Mfc]. Then applying Eq. (A.4) in Lemma A.2, 
and using also Eq. (A.16), we obtain 

( t-K-l \ t 

(P,rA)‘-M 1 + L'• E[|Ffc-4^|] •(lT(P^rA)*-"l) 

k=0 J k=t-K 

<L' -1^ ( Y i + lpl^y(^^ 

\A:=X+1 / V k=0 / 

< L' • l" Y {PnTA)^'] 1 + L' • • (1^(7 - P^rA)-il) L^Y 

\k=K+l / 
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Now let Lk = ■ Froni the expressions of V'^’ and V-^’ given above, elearly, Lk i 0 

as iC —)• oo. Then, in view of the relation ll^t — < I-^i — ^t,K \ + llo — we obtain the 


d2) 


( 1 ) 


'(2) 


desired bound E Yt — Yt^x < E 


\Ft-Fi 


t,K\ 


+ E et — et^K < Lk- This proves part (i). 


Part (ii) is proved similarly. By the definition of the truneated traees, for t>2K' > 2K, 


t-K-l 

Ft,K' - Ft^K = 

k=t-K' 


U,K 


■ (pfeTfc+1 ‘ ‘ ‘ Pt—i'yt) 1 


and 


t-iC-l 

h,K' - et,K = ^ Mk,K' ■ (t^{Sk) ■ {h+i ■■■ Pt) 

k=t-K' 
t 

+ ^ 4>{Sk) ■ {Fk,K' — Fk^x) ■ (1 — Afc) • {Pk+i • • • A)- 

k=t-K 

We then apply the same ealeulation as in the proof of part (i). When t > 2K', the truneated traees 
do not depend on the initial eondition (eo, Fq). Sinee the state and aetion spaees are finite, we ean 
set the eonstants L,L' to be independent of the initial eondition of Zq. Part (ii) then follows. ■ 


Remark A,1 (On the behavior of trace iterates) From the properties of {{et, Ft)} given above 
and the ergodieity of the Markov ehain {{St, At, et. Ft)} shown in Theorem 3.2, we see that these 
traee iterates are well-behaved. On the other hand, like in regular off-poliey algorithms, these 
iterates ean be unbounded almost surely and their varianees ean grow to infinity with time. There 
are no eontradietions here. To illustrate this point, let us eonsider a simple example with just 1 state 
and 2 aetions, S = {1},^ = {oi, a 2 }> where all aetions result in a self-transition at state 1. Let 
7 r(ai I 1) = 1 for the target poliey tt, and let 7 r°(ai | 1) = g < 1 for the behavior poliey 7 r°. Let the 
diseount faetor be a eonstant 7 < 1. Then for all t, 

E[^^pU\Ft-i]=7Vq- 

Suppose 7^/(7 > 1. Then even with f(l) = 0, if Fq > 0, the definition Ft = ^tPt-iFt-i implies 
that 

E[Fi] = E[E[^IpU I L^t-i ] • Fl^] = {^^/qf ■ F^ ^ 00 , 

yet sinee f(l) = 0, {Ft} is also a supermartingale eonverging to 0 a.s. (ef. the proof of Lemma A.l). 
For the ease i(l) > 0, again E[F^ ] —)■ 00 if 7 ^/? > 1, and by (Yu, 2012, Prop. 3.1) the sequenee 
{Ft} is almost surely unbounded if 7/(7 > 1 , yet {Ft} is bounded in probability in the sense 
deseribed by Prop. A.L 

As mentioned earlier in Remark 2.2, it ean be desirable to restriet the behavior poliey so that 
the varianees of the traee iterates do not grow to infinity. In the simple example above, this ean be 
easily arranged. In the general ease, however, if the state-dependent diseount faetor 7 (-) ean take 
the value 1 for some states, then without knowledge of the MDP model, to suffieiently restriet the 
behavior poliey seems to be a diffieult task. 
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A.3. Proof of Theorem 3.1 

For convenience, we restate Theorem 3.1 here. Recall that the theorem concerns the recursion 

Gt+i = (1 — at) Gt + at h{Yt, St, At, St+i), 
where Yt = {et. Ft), and the function h is Lipschitz continuous in y: for some constant L^, 

\\h{y,s,a,s) - h{y,s,a,s')\\ < Lh\\y - y\\, yy,y e \f{s,a,s) eS x AxS. 

Theorem 3.1 -convergence of {G*}) Let hbea vector-valued function satisfying the Lipschitz 
condition (3.1 ), and let {Gt} be defined by the recursion (3.2), using the process {Zt}. Then under 
Assumptions 2.1, 2.3, there exists a constant vector G* (independent of the stepsizes) such that for 
any given initial Yq = (cq, Fq) tind Gq, limi^oo — G*||] = 0. 

Proof The proof proceeds in three steps: 

(i) For each iT > 1, we consider the truncated traces Yt^K = Ft^x), t = 0,1,.. ., defined by 
Eqs. (A.13)-(A.15). Correspondingly, we define iferafes Gq^k = Gq and 

Gt+i,K = (1 “ «i) Gt^K + a* h{Yt^K, St, At, St+i). 

For each t, Yt^x is a funcfion of (St- 2 X, At- 2 X, ■ ■ ■, St), so h{Yt^x, St, At, St+i) can be viewed 
as a funcfion of Xt = {St- 2 X, At- 2 X, ■ ■ ■, St+i), where {Xt} is a finife sfafe Markov chain wifh 
a single recurrenf class by Assumpfion 2.1(ii). Then, wifh Eq denofing fhe expecfafion under fhe 
sfafionary disfribufion of fhe Markov chain {(5*, At)}, we have, by a resulf from sfochaslic approx¬ 
imation fheory (Borkar, 2008, Chap. 6, Theorem 7 and Cor. 8), fhaf under Assumpfion 2.3 on fhe 
sfepsizes. 


Gt,x “4- G},, where = Eq [ h(Yk,x, Sk, Ak, Sk+i) ] yk>2K. (A.17) 

Clearly, fhe vector G^ does nol depend on fhe initial condition (Yq, Gq) and fhe sfepsizes {at}. 
Since for all t, ||Gt^x|| < L for some consfanf L < oo, we also have by fhe bounded convergence 
fheorem 

hm E[||Gi,K-G^^||] =0. (A.18) 

(ii) We show fhaf as iF —)• oo, G|^ converges to some vecfor G*. For any K' > K, using fhe 
Lipschifz properly of h and Prop. A.3(ii), we have fhaf for k > 2K', 

||G^/ - G|f II = ||Eo[/i(Yfc_i^/, Sk,Ak, Sk+i) - hiYk^x, Sk, Ak, 5'fc+i)] || 

< Eo [|I 11] <LhLx, 

where Lx is some consfanf wifh Lx i 0 as iF —)■ oo. This shows fhaf {G|^} is a Cauchy sequence 
and hence converges lo some G*. 

(iii) We eslablish fhe fheorem by bounding fhe differences belween Gt and Gt,x for an increasing 
K. For each K, 

limsupE[||Gi - G*||] < limsupE[||Gi - Gt,i^||] + limsup E[||Gi,i^ - G*k\\] + \\G*k - G*||. 

t^OO t^OO t^oo 
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In the right-hand side, the seeond term equals 0 by Eq. (A. 18), and the last term eonverges to 0 as 
K —)■ oo, as we just showed in step (ii). Consider now the first term. Sinee 

Gt+i — Gt+i^K = (1 ~ oit){Gt — Gi^k) + oit{h{Yt, St, At, St+i) — h{Yt,K, St, At, •S'i+i)) 

and \\h{Yt,St,At,St+i)-h{Yt^K,St,At,St+i)\\ < by the Lipsehitz property of/i, 

we have 


E[\\Gt+i - Gt+i,K\\] < {l-at)E[\\Gt-Gt,K\\]+atLhE[\\Yt-Yt,K\\] 

< (1 — at)E[||Gt — + atL^Lx, (A.19) 

where the seeond inequality follows from Prop. A.3(i), whieh gives the eonstants Lk, K > I, with 
Lk i 0. For eaeh K, in view of Assumption 2.3 on the stepsize, the inequality (A. 19) implies that 

limsupE[||Gt - (5t,i^||] <LhLK- 

t^OO 

Then, sinee Lj^ i 0, letting K go to infinity in the right-hand side of the preeeding inequality, it 
follows that limt_s.oo — G'*!!] = 0. ■ 


A.4. Handling Noisy Rewards: Proof of Prop. 3.1 

For eonvenienee, we restate Prop. 3.1 below. For eaeh f > 0, let iOt+i = Rt — r{St, At, 5'i+i), the 
noise in the observed reward Rt. We eonsider the reeursion (3.4): ITo = 0 and 

Wt+i = (1 - at) Wt + at et pt ■ oJt+i, f > 0. (A.20) 

Reeall that it is assumed in our MDP model that Rt has mean r{St, At, St+i) and bounded varianee; 
speeifieally, let Rt = a{So, Aq, ..., St+i) in what follows, and we have that for some eonstant 
L < oo, 

E'[oJt+i I Ri\ = 0, I Ft\ < L- (A.21) 

Proposition 3.1 (Effects of noise in random rewards) Under Assumptions 2.1, 2.3, for any given 
initial (eo,Fb). we have (i) E[||VFt||] —0; and (ii) if, in addition, the stepsize is at = l/(f + 1), 
then Wt 0. 

Beeause the proofs of part (i) and part (ii) use quite different arguments, we give them separately 
below. 

Proof of Prop. 3.1(i) To simplify notation, denote iot+i = pt Similarly to the proof of The¬ 
orem 3.1, we first eonsider for eaeh K >1, the truneated traees {G,K: f > 0} given by Eq. (A. 15), 
and we replaee {et} in the reeursion (A.20) by {G,k} to define iterates Wo,if = 0 and 

Wt+i,K = (1 — at) Wt^K + at et,K ■ dtt+i, f > 0. 

Sinee the number of states and aetions is finite, we ean bound ||et^x|| by some finite eonstant for all 
t. Then, using Eq. (A.21), we have that for all f > 0, 

E [G,k ■ dJt+i I Ft] = 0, E Ip • \ Ft\ < L' for some eonstant L' < oo. 
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Under Assumption 2.3 on the stepsize {at}, this implies by (Tsitsiklis, 1994, Lemma 1) that 
Wt,K “4- 0. _ _ 

Next we show limt^.oo E[UlUt^KlI] = 0. Sinee at G (0,1] for all t and lUo.i^: = 0, for eaeh 
t > 0, Wt+i^K can be expressed as a eonvex eombination of eyx < t, with eoeffieients ctj 

(eaeh ctj is a funetion of (ao,..., at)). Consequently, ||lUt+i,ft:|| < Yl]=o ■ |^i+i|; and 

by the eonvexity of the funetion 

< Y}fj=o^t,j\\ej,K\\‘^ ■ |wj+ip. 

As diseussed earlier, the varianee of ej^K • ^j+i can be bounded uniformly for all j, so the preeeding 
inequality implies that there exists some eonstant L' < oo with 

E[\\Wt+i,Kf]<L', Vf>0. (A.22) 

This in turn implies that the sequenee ||, f > 0} is uniformly integrable (see e.g., Billingsley, 

1968, p. 32); i.e., 

sup E ||VLt^i^|| • l(||VUt^i^|| > a) —)■ 0 as a —)• +oo, 
r>o L ’ ’ J 

(where l(|| VLj i^|| > a) is the indieator for the event UlUt^xH > a). By (Neveu, 1975, Lemma IV-2- 
5, p. 66), every uniformly integrable sequenee of random variables whieh eonverges almost surely 
also eonverges in L^. Therefore, sinee Wt^K —^ 0 as proved earlier, we have limt^,oo E [|| VLt^xll] = 
0 . 

We now prove lim^^oo E [||lTt ||] =0 similarly to the proof step (hi) for Theorem 3.1. For eaeh 


K > 1, sinee limt_s.oo E[ VFt.i^: ] = 0, we have 




limsup E [| lLt||] < limsupE || VFt — lTt,ii-| 

t^OO t^OO L 

+lim sup E 

t^OO 

\\Wt,x\\ 

= limsupE 1 VLt — VFt.i^ll 

Thus it is suffieient to prove that 




lim lim sup E 

K^oo t—too 1 

\Wt-Wt,x\\ 

= 0. 

(A.23) 


To this end, let us write 


Wt+i — Wt+i,K = (1 — at){Wt — Wt^x) + at{et — h,K) ■ wt+i. 

Sinee the number of states and aetions is finite, Eq. (A.21) implies that for all t, E | | Ft\ < L' 
for some eonstant L' < oo. Consequently, 

E ||lUt+i - FLt+i^i^ll < (1 - ai) E ||lUi - lLt,ii-|| + at L'• E [||et - et,x||] 

< (1 — at) E II VFt — lUt,ii-|| FatL'-Lx, 

where the seeond inequality follows from Prop. A.3(i), and Lx,K > 1, are eonstants with the 
property that Lx i 0 as iT —)• oo. By Assumption 2.3 on the stepsize, the preeeding inequality 
implies that for eaeh K, 

limsupE IIVFt — FFt^i^ll < L'■ Lx. 
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Letting K go to infinity and using the fact Lk i 0, we obtain the desired equality (A.23), which 
implies limt_s.oo E [||VLt||] = 0 as discussed earlier. ■ 

Proof of Prop. 3.1(ii) Note that with = l/(f + 1), the convergence Wt 0 we want to prove 

is equivalent to the convergence of the time average, ^ Ylk=o ' Pk^k+i 0> where each term 
in the sum is a function of (e^, Sk,Ak, Sk+i, Rk)- 

Cfc ■ Pk^k+l — Cfc ■ p{Sk: ‘ (^Rk A/j, Sx+l)^ ■ 

By Theorem 3.2, the Markov chain {Zt} = {(S'*, At, et, Ft)} has a unique invariant probability 
measure (}. Consequently, the Markov chain {Z't} := {{St, At, et. Ft, St+i,Rt)} has a unique in¬ 
variant probability measure C', determined by ( together with the probabilities of the successor state 
St+i given {St, At) and the conditional distribution of the reward Rt given {St, At, St+i), which are 
specified by the MDP model. Let denote expectation with respect to the probability distribution 
of the stationary Markov chain {Z).} with the initial distribution being From Theorem 3.2(ii) and 
the relation between (}' and C, we have 

^C'[l|eo|| • Po |wi|] < L'E^[||eo||] < oo (A.24) 

for some constant L' < oo. Specifically, in the above, we obtain the first inequality by bounding the 
conditional expectation of jcjil, conditioned on (cq, Sq, Aq), by some finite constant [cf. Eq. (A.21)], 
and we then obtain the second inequality by applying Theorem 3.2(ii). 

Given the finite expectation in Eq. (A.24), we can apply to the stationary process {Z't} with 
initial distribution (}' a strong law of large numbers for stationary processes [(Doob, 1953, Chap. X, 
Theorem 2.1); see also (Meyn and Tweedie, 2009, Theorem 17.1.2)]. By this theorem, there exists 
a nonempty subset Di of the state space of {Z'^} such that: 

(i) Di has (^'-measure 1, and 

(ii) for each initial condition Zq = z' G Di, {Wt} converges a.s. (with respect to the probability 
measure induced by the initial condition Zg = z' for the process {Z^}). 

In view of the dependence relations between the variables {St, At, St+i, Rt) given by the MDP 
model, and also in view of the finiteness of the state-action space 5 x A and the irreducibility 
property of the behavior policy 7r° (Assumption 2.1(ii)), the preceding properties of the set Di 
imply that there exists a nonempty subset D 2 of the space (which is the space of {et, Ft)) such 
that: 

(i) D 2 has measure 1 with respect to the marginal on of the invariant probability measure 
C, and 

(ii) for each initial condition (eo, Fq, Sq) = {e, F, s) G 792 x S, {Wt} converges a.s. 

But the limit of {Wt} cannot differ from 0 by Prop. 3.1(i) proved earlier (since for the given initial 
condition, 77[||14t||] —0 implies the existence of a subsequence of {Wt} converging to 0 a.s. 
(Dudley, 2002, Theorem 9.2.1)). Thus, we conclude that for each initial condition {e,F, s) G 
D 2 X 5, Wt ''-4' 0. 

Now to establish Prop. 3. l(ii), we only need to show that for any given initial condition (eo, T^o) = 
(e, F) 0 D 2 , Wt 0 as well. To prove this, let s G 5 be an arbitrary given state. Con¬ 
sider the sequence {{et, Ft,Wt)} with {eo,Fo) = {e,F) and 5o = s. Consider also a second 
sequence {{et. Ft, Wt)} which is generated by the same recursion and the same trajectory of states, 
actions and rewards that define the first sequence {{et, Ft, Wt)}, but with a pair of initial traces 
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(eg, Fq) = (e, F) G D 2 , possibly different from (e, F). By what we proved earlier, Wt 0. On 
the other hand, by definition 


__ ^ ^ 

Wt+i - Wt+i = ^ (efc - efc) • PfcWfc+i- 

^ k=0 

In the right-hand side, we have 0 (as k -G 00 ) by Prop. A.2, and we also have 

t^Yl\:=o Pk^^k+i 0 by the properties of {ojt} and (Tsitsiklis, 1994 , Lemma 1 ).^^ Conse- 

(2 S Q, 5 (X S 

quently, Wt+i — Wt+i —^ 0; sinee Wt ^ 0, this implies Wt ^ 0. The proof is now eomplete. ■ 


A.5. Proof of Theorem 2.1 on the Convergence of ELSTD(A) 

The proof proeeeds by ealeulating the limit G* in Theorem 3.1 for the two funetions hi, /12 in 
Eq. (3.3): with y = (e, F) G 

hi{y,s,a,s') = e-p{s,a) (7(s')</>(s')"^ -> h 2 {y,s,a,s') = e-p{s,a) r{s, a, s'), (A.25) 

whieh are assoeiated with the ELSTD(A) iterates Ct,bt, respeetively. Speeifieally, based on the 
proof of Theorem 3.1, we first ealeulate for eaeh K, the limit given in Eq. (A. 17), whieh is 
assoeiated with the truneated traees {et^x, Ft^x)- We then take AT to 00 to get the expression of G* 
sinee G* = limi^^oo as shown in the step (ii) of the proof of Theorem 3.1. The details of this 
ealeulation are given below, and the subsequent Lemma A.4 establishes that 

G* = C for h = hi] G* = b for /i = / 12 . (A.26) 

Let us give now the rest of the proof of Theorem 2.1, assuming for the moment that Eq. (A.26) 
has been proved. Then, with h = hi. Theorem 3.1 yields the L^-eonvergenee of {Gt} to C, and 
Theorem 3.3 yields Gt G for stepsizes at = l/{t + 1). 

Eor the iterates {bt} [ef. Eq. (2.11)], we also need to take eare of the noise in the rewards Rt, by 
using Prop. 3.1. Speeiheally, with ITo = 0> let 

ujt+i = Rt-r{St,At,St+i), Wt+i = {I-at)Wt + atetPfU}t+i, t>0, 

[ef. Eq. (3.4)]. By definition, 

bt+i = (1 - at) bt + atef pt Rt = {I - at) h + atef pt [r{St, At, St+i) + oJt+i) , 

so the iteration for {bt} ean be equivalently expressed as 

bt+i = Gt+i + Wt+i, 

17. Specifically, we write Xt+\ := X]I=o Pfe‘Vfe+i equivalently as the recursion, 

Xt+i = (1 — xr[) Ft + Pt<vt+i with Xo = 0, 

and apply (Tsitsiklis, 1994, Lemma 1) to obtain Xt 0. To apply the latter lemma, observe that since S and A are 
finite spaces, pt is bounded for all t, and the weighted noise variables {ptoJt+i} thus have conditional zero mean and 
uniformly bounded variances, conditioned on Xt, similar to the properties of {uJt} shown in Eq. (A.21). Then, the 
conditions of (Tsitsiklis, 1994, Lemma 1) are satisfied and its conclusion applies. 
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where Gt+i is given by the reeursion (3.2) with h = h ‘2 and Go = 6 o, and Wt+i is as defined above. 
Then by Theorem 3.1, Eq. (A.26) and Prop. 3.1(i), we have 

lim E\\\bt - 6111 < lim ENlGt - G*\\] + lim En|VFi||l = 0. 

This proves the L^-eonvergenee of {bt} to 6 . Similarly, its a.s. eonvergenee in the seeond part of 
Theorem 2.1 follows from Theorem 3.3, Eq. (A.26) and Prop. 3.1(ii) as 

Gt '"4- G* = b and Wt 4 0 ^ bt = Gt + Wt 4 b. 

Thus Theorem 2.1 is proved. 

In the rest of this subseetion, we verify Eq. (A.26), whieh we used in the proof above. 

Computing the Limiting Matrix and Vector for ELSTD(A) 

The desired limits for EESTD(A) are the matrix G and veetor 6 given in Eqs. (2.6)-(2.9), Seetion 2: 

= (/ - P^TA)-'^ (/ - P^T) ^>, (A.27) 

6 = (/ - P^EA)-^ (A.28) 

where M is a diagonal matrix with 

diag{M) = djo ^ (/ - d,r°,i C , 4o^i{s) = 4o{s) ■ i{s), s £ S, 

and dn°{s) is the steady state probability of state s under the behavior poliey 7 r° (ef. Assump¬ 
tion 2.1(ii)), and i(s) is the “interest” weight for state s. By the definition (2.6) of we ean also 
write 

diag{M) = d^o^i (/ - P4)~^ (I - P^EA). (A.29) 

Reeall that E^ denotes expeetation with respeet to the stationary Markov ehain {Zt}, where 
Zt = {St, At, Ft), with its unique invariant probability measure C, as the initial distribution 
(ef. Theorem 3.2). We denote E) = {et. Ft)- 

Lemma A.4 Under Assumption 2.1, 

Ej/ii(yo,Po,^o,5i)] =c, E^[/i2(yo,5o,^o,5i)] =b. 

Proof The proof proeeeds as follows. We first ealeulate the limit veetor G|^ defined in Eq. (A. 17) in 
the proof of Theorem 3.1, for the two ehoiees of the funetion h given in Eq. (A.25): h = hi, h = 6 . 2 - 
We then ealeulate G* by its definition given in the proof of Theorem 3.1: G* = limif_^oo G*j^. 
Theorem 3.3 shows E^ [h{YQ, Sq, Aq, 5i)] = G*, so the lemma follows if we prove that G* = G 
for h = hi, and G* = 6 for /i = 6 , 2 - 

Eet Eq denotes expeetation with respeet to the probability measure of the stationary Markov 
ehain {{St, At)}. Reeall that for eaeh K > 1, G^ is defined by Eq. (A. 17) as 

G*K = Eo[h{Yt,K,St,At,St+i)], yt>2K, (A.30) 
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where Yt^x = Ft,K) are the truneated traees defined together with Mt^x in Eqs. (A.13)- 

(A.15). To simplify the ealeulation of we first ealeulate Eq \Mt,K'^{St)\ for any given t > K 
and any matrix or veetor-valued funetion T' on 5. 

By Eqs. (A.13HA.14), 

t 

Mt^x = + {I — Xt) Ft^x, Ft^x = ^ i{Sk) ■ {pklk+i-■ ■ Pt-ilt)- 

k=t-K 

Thus, Mt^x can be equivalently expressed as 

K 

Mt,x = i{St) + ^ i{St-k) • {pt-k^t-k+i • • • Pt-i7i) • (1 - Xt). 

k=l 

To ealeulate Eq , we ealeulate the expeetation for eaeh term in the above summation 

separately. 

In what follows, for eaeh s € S, let Is(-) denote the indieator for state s. Eor an expression ff 
that results in an A^-dimensional veetor, we write {F[){s) for the s-th entry of the resulting veetor. 
Under Assumption 2.1(ii), we have 

Eo[f(5t) -1.(5*)] = d^o{s)iis) = {dlo^J){s), 

and for k = 1,2,..., K, 

Eo[i(5i_fc) • {pt-k7t-k+i • ■■Pt-I7t) • (1 - Xt) • ls{St)] = (dlo4P^r)’^{I - A)'^ {s). 

Henee 

Eo[Mt,x-USt)] = + {s), 

and eonsequently, 

Eo[Mt,K-^{St)] =Y,^o[Mt,x-ts{St)] -^{s) 

sScS 

= E + (s)-^(s). (A.31) 

s£S V k=l j 

Eet us now ealeulate for h = hior /12 simultaneously, using Eq. (A.30) and the expressions 
of hi, /i 2 given in Eq. (A.25). Eet t > 2K, and let Ft = cr{So, Aq, ..., St). Eor the term appearing 
after e in the expression of hi, we have 

Eo[pt(7m</>(^i+i)^ - I d^t] = ^i{St), 

where T'l maps eaeh s to the s-th row of the matrix (PttT — /)<h. Eor the term appearing after e in 
the expression of / 12 , we have 

Eo[ptr{St,At,St+i)\Ft] =^ 2 {St), 
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where "^2 maps eaeh s to the s-th entry of the veetor Denote = (P,rr — I)<h, 'hf = 

Corresponding to h = hi or /i 2 , let ih = ihi or ^^ 2 , and let = ih® or respeetively. Then, 
by Eqs. (A.30) and (A.25), we have 


G*k = Eo[^o[h{Yt,K,St,At,St+i) I j;]] 

= Eo[et,K--^{St)] 

t 

= Eo I Mk,K • <i>{Sk) • if3k+i • • • A) • '^{St) 


lk=t-K 


(A.32) 


^ Eo [ Mk,K • <P{Sk) • Eo [ {f3k+i • • • /3t) • ^(5t) | Tk ] 


k=t-K 

t 


k=t-K 


Y, Eo Mk,K-<P{Sk)-{{PnrAY-’^^°)^{Sk) 


(A.33) 


In the above we used the definition (A. 15) of G,k in Eq. (A.32), and in Eq. (A.33), the term 
(• • • )r(*S'A:) inside the expeetation denotes the 5fc-th row of the matrix or veetor given by the ex¬ 
pression inside the parentheses (•••)• We shall also use this notational eonvention in the proof 
below. 

Erom Eq. (A.33), using the faet that the expeetation is with respeet to the stationary Markov 
ehain {(5^, A^)}, we obtain 




k=t-K 


= Eo 


K 

Mt,K • <P{St) ■ ( J](P.rA)'^ • 


k=0 


(A.34) 


Corresponding to the last two terms inside the expeetation above, define a funefion ihif on 5 by 


\fc=0 /r 


Then by eombining Eq. (A.34) with (A.31), we have 


G*k = Eo 


Mt,K-'^K{St)] = +j;(P.r)'=(/-A))') (s)-ihi,(s). (A.35) 


s£S 


k=l 


We now take K to infinity to get the expression of G*. Eor h = hi, 'k'’ in the definition of 
is given by ik® = (FttE — /)<k. Using the equality relations. 


/ + YiP^^ fil - A) = (/ - P.r)-i(/ - P^EA), 

k=l 


J^(p.rA)'= = (/- p^rA)-\ 

k=0 
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we obtain from the expression (A.35) of that 

- PnrA)){s) • (Pis) ■ ((/ - P^TA)-^ ■ (P^T - m)^is) 

s£S ^ 

= (/ - PvrrA)-^ (p^r -i)^ = c, 

where the second equality follows from the expression (A.29) for diagiM), and the third equality 
follows from Eq. (A.27). For h = /i 2 , in the definition of '^k is given by 'E" = r^r, and a similar 
calculation gives 

= E - P.TA)) is) • (Pis) • ((/ - P^EA)-! • r^)^(s) 

sScS 

= (/ - P^rA)-i = b, 

where the last equality follows from Eq. (A.28). ■ 


A.6. Related Result: Alternative Proof of Existence of an Invariant Probability Measure 

Consider the Markov chain {Zt}, where Zt = (5*, At, et, Ft)- In Section 3, we used, among others, 
the weak Feller property of the Markov chain {Zt} to establish the existence of at least one invariant 
probability measure for {Zt} (the property (iv) in Section 3.1). We now give an alternative proof 
for this statement, by constructing directly an invariant probability measure. This proof is similar 
to that of (Yu, 2012, Fernma 4.2), and it was motivated by an analysis of the FSTD(l) algorithm 
by Meyn (2008, Chap. 11.5.2). The proof will also yield directly that under that invariant probability 
measure, ||(eo, Po)|| has a finite expectation, which was established in Theorem 3.2(ii) earlier by 
using different arguments. 

Proposition A.4 Under Assumption 2.1, the Markov chain {Zt} has at least one invariant proba¬ 
bility measure C, with [|| (eo, Po) ||] < co. 

Proof Consider a double-ended stationary Markov chain {iSt, At) \ —oo < t < oo} with transition 
probability matrix P^o and probability distribution P°. In this proof, let Eq denote expectation 
with respect to P“. Fet Xt = ((S*, Aj), (5t_i, At_i),...), and denote by Px the probability 
distribution of Xt, which is a probability measure on (5 x A)°° and is the same for all t due to 
stationarity. 

We will first define two functions, / : (5 x A,)°° —)■ E+ and : (5 x A,)°° —)• E”, which relate 
to the traces F, e, respectively. We will then show that the distribution of (Pq, Aq, /(^o)) 

is an invariant probability measure of {Zt}. 

Fet us introduce some notation. For x G (5 x A)°°, we index the components of x as 
X = ((so, ao), (s-i, a_i),...), and we denote by the tail of x starting from s-k, i.e., 

^i-k) ^ iis_k, a-k), is-k-i,a-k-i), ■■■)■ Recall that f3t = pt-iyt^t- Correspondingly, we de¬ 
fine a funcfion /3;5xAx5—)-E+by 

Pis, a, s') = pis,a)yis') A(s'). 

For an expression H fhaf resulfs in a vecfor in we wrife (P)(s) for fhe s-fh enfry of fhaf vecfor. 
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We now define the funetion Sinee 

oo oo 

k=0 

oo 

= < oo, (A.36) 


k=0 


k=0 


we ean define a nonnegative real-valued measurable funetion / on (5 x sueh that 

^ iis-k) ■ {p{s-k, a-k) 7 {s-k+i) ■ ■ ■ p{s-i, 0-1)7(50)), if x G Di, 

1 0, otherwise, 

where Di is a subset of (5 x 7l)°° with Px{Di) = 1. By Eq. (A.36), 

/ OO 

f{x)Px{dx) = Eo [f{Xo)] = ^ Eo [i( 5 ’_fc) • {p-k^f-k+i ■ ■ ■ P-ilo)] < 00. 

k =0 

We now define the funetion i/’- Define two eonstants L', L as follows. Let L' = Eq [/(Xq)] ; 
equivalently, L' = Eo[/(X_fc)] for all A: by stationarity. Let L > max {f(s), ||i;f)(s)||} for all s G S. 
By taking eonditional expeetation similarly to the proof of Lemma A.2, we have the following 
bound: 


Eto 

k=0 

00 

= E‘:« 

k=0 


I (A_fc • i{S.k) + (1 - X-k)f{X-k)) • cPiS.k) • (/3-fc+i • • • /3o) I 
(A_fc • iiS.k) + (1 - X-k)fiX.k)) • \\<i>{S_k) II • {P-k+i • • • /3o) 


(p^rA)'=i < 00. 

fc =0 


Therefore, by a theorem on integration (Rudin, 1966, Theorem 1.38, p. 28-29), we ean define a 
measurable funetion i/) on (5 x 7l)°° sueh that the following hold: 

(i) on a set D 2 C {S x 7l)°° with PxiD^) = 1, 


OO 

'^{x) = X] {X{s-k) iis-k) + (1 - X{s-k)) /(x^"^^)) • <j){s-k) 

k=0 

■ {(3{s-k,a-k,s-k+i) ■ ■ ■ f3{s-i,a-i,so)), 


where the infinite series on the right-hand side eonverges to a veetor in IR"^; 

(ii) outside D 2 , ip(x) = O', and 

18. We note that to gain intuition about the proof, it will be helpful to compare our definition of / with the expression of 
Ft in Eq. (A.7), and compare our subsequent definition of ^|) with the expression of et in Eq. (A.8). 
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(iii) 'il){x) is integrable with 

Eo[||V’(-^o)||] = J \\'ipix)\\ Px{dx) < CO 

and 

/ oo 

'ip{x) Px{dx) = E Eq fc) “h (1 ^—k)f (-^—k)) ■ fc) ■ (/?—fc+1 ■ ■ ■ /^o) j • 

A:=0 

Let Yq = {tP{Xq), /{Xq)). We now show that the probability distribution of {Sq, Aq, Yq) is an 
invariant probability measure of the Markov chain {Zt}. To this end, consider Xi = ((5i, Ai), Xq), 
and let us define Yf = (e°, Ff) based on Yq and (5o, *51)? using the same recursion that defines 

{et,Ft) [cf.Eqs. (2.2)-(2.4)]: 

= 71 po • f{Xo) + i{Si), el = h ^o) + (Ai i(5i) + (1 - Ai) F°) • </.(5i). 

If (5 o,Ao,Xq°) and {Si, Ai,Y°) have the same distribution, then this distribution must be an 
invariant probability measure of {Zt} because the stochastic kernel that governs the transition 
from (5o,Ao)^o°) is the same as that from Zq = {So, Aq, {cq, Fo)) to Zi = 

Now define a set F C (5 x by F = Fi n F 2 n (5 x Al x {Di n D 2 )), where Di, D 2 are 
the sets in the definitions of the functions / and respectively. Since Px{Di) = Px{D 2 ) = 
we have 

Px{D) = P°(Xi £D) = P°(Xi G n D2, Xo G F>i n D2) = 1. 

Consider the case Xi G D. Then both Xq, Xi G Fi n D2. By the definition of / on Di, it follows 
that 

00 

= i{Si) + • (p_fc7_fc+i • • -p-no) • mi = f{Xi), 

k=0 

and from this and the definition of if) on D 2 , it also follows that 
e? = (Aii(5i) + (l-Ai)/(Xi)).</>(5i) 

00 

+ ^ (A-fc • i{S-k) + (1 — A_fc)/(X_fe)) • (j){S-k) ■ {P-k+i ■ ■ ■ Po) ■ Pi 

k=0 

= P{Xi). 

By stationarity, {So, Ao,'ip{Xo), f{Xo)^ and (5i, yli, V'(Xi),/(Xi)) have the same distribution. 
Denote this distribution by (. Since (ei,Ff) differs from (V'(Xi),/(Xi)) only when Xi 0 D, 
an event with P“-probabihty 0, we conclude that (5o,^0)^o^) und have the same 

distribution which is an invariant probability measure of {Zt} as discussed earlier. Then, from 
the integrability property of i}) and / shown earlier, we have 

Ec[||(eo,Fo)||] < Ej||eo||] +E^[Fo] = Eq[||V’( 2fo)||] + Eq[/(Xq)] < 00 . 

This completes the proof. ■ 


34 



On Convergence oe Emphatic Temporal-Difference Learning 


Appendix B. Proofs for Section 4 

In this appendix we prove Lemma 4.1 and Theorem 4.1 for the eonstrained ETD(A) algorithm (4.5). 
We will restate both theorems for eonvenienee. 

Reeall that the eonstrained ETD(A) ealeulates Ot, t > 0, all restrieted to be in a elosed ball with 
radius r, B = {9 ^ \ || 6*||2 < r}, aeeording to 

0t+i = Bb [Ot + at Hdt, 6 ) + atef u>t+i^ , 

where £Z>t+i = pt{Rt-r {St, At, St+i)) is noise, = (e^, 5*, 5i+i), and the funetion/r is given 

by Eq. (4.3) as 

h{9,C) = e • p{s, a) {r{s, a, s') + 7 ( 5 ') 4>{s)'^9 - 4>{s)'^9), for ^ = (e, s, a, s'). 

The “mean ODE” assoeiated with this algorithm is the projeeted ODE (4.6): 

X = h{x) + z, z £ —Mb{x), 

where h{x) = Cx + b, Mb{x) is the normal eone of B at x, and 2 ; is the boundary refleetion term 
that keeps the solution in B (Kushner and Yin, 2003). The solution of h{x) = 0 is denoted 9*', i.e., 
9* = -C-^b. 

Lemma 4.1 Let c > Q be such that x~^Cx < —c||x|| 2 /or all x £ IR'^. Suppose B has a radius 
r > II 6 II 2 /C. Then 9* lies in the interior of B, and the only solution x{t),t £ (— 00 , + 00 ), of the 
projected ODE (4.6) in B is xf) = 9*. 

Proof By the definition of 9*, C9* + b = 0. Therefore, 

0 = + 6 ) = {9*,C9*) + {9*,b) < -c||r||2 + || 6 || 2 ||r|| 2 , 

whieh implies || 0*||2 < &II 2 /C < r, i.e., 6* lies in the interior of B. 

Eor a point x on the boundary of B, ||x ||2 = r and the normal eone J\fB{x) = {ax | a > 0}. 
Sinee r > || 6 || 2 /c, we have 

{x, h{x)) = {x, Cx) + {x, b) < -c||x ||2 + ||x|| 2 || 6||2 = r {-cr + || 6 || 2 ) < 0. 

This shows that for any x on the boundary of B, h{x) points inside B and henee at x, the boundary 
refleetion term 2 ; G —J\fB{x) that keeps the solution in B is the zero veetor. Consequently, any 
solution of the projeeted ODE (4.6) in i? is a solution of the ODE (4.4), whieh is x(-) = 0*. ■ 

Next we prove Theorem 4.1. 

Theorem 4.1 (Almost sure convergence of constrained ETD(A)) Let Assumptions 2.1-2.3 hold. 
Let {9t} be the sequence generated by the constrained ETD(X) algorithm (4.5) with step sizes sat¬ 
isfying at = 0{l/f) and = 0{l/t), and with the radius r of B exceeding the threshold 

given in Lemma 4.1. Then, for any given initial (eo, Fq, 9q), 9t 9*. 
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Proof The desired eonelusions will follow immediately from (Kushner and Yin, 2003, Theorem 
6.1.1) and Lemma 4.1, if we ean show that the eonditions of (Kushner and Yin, 2003, Theorem 
6.1.1) are met. Relevant here are the eonditions A.6.1.1-A.6.1.4 and A.6.1.6-A.6.1.7 in (Kushner 
and Yin, 2003, p. 165). We first adapt these six eonditions to our problem, and by using stronger 
forms of the eonditions A.6.1.6-A.6.1.7 given in (Kushner and Yin, 2003, Eq. (6.1.10), p. 166), we 
obtain the eonditions (i)-(vi) below. 

The first two eonditions are for the funetions h, h [ef. Eqs. (4.3), (4.4)] and the noise {ojt}: 

(i) supj>oE[||/r(6li,^i) + et -wt+ill] < oo. 

(ii) h{6) is eontinuous, and h{9, ^) is eontinuous in 9 for eaeh 

Condition (i) is satisfied here. Indeed, we have supj>Q E[||/i(0t,^t)||] < in view of Prop. A.l, 
fhe Eipsehifz eonfinuify of h in e, and fhe fael fhaf \\9t\\2 < r for all f by fhe definilion of fhe 
eonsfrained algorifhm. Sinee fhe rewards Rt have bounded varianees by assumpfion and fhe noise 
variable ujt+i = pt{Rt — ^{St, At, St+i)) by definilion, we ean bound E[|2;t+i| | Rt] by some 
eonslanl for all t, where R = Aq, ..., St+i), and eonsequenfly, we also have supj>Q E [\\et ■ 
ojt+i II] < oo by Prop. A.l. Henee eondifion (i) holds. Condifion (ii) is also elearly satisfied here. 

The four remaining eonditions lo be inlrodueed are of fhe same type and relale lo fhe asymplolie 
rale of ehange eondilions inlrodueed by (Kushner and Clark, 1978). These eondilions ean guarantee 
lhal fhe effeels eaused by fhe noises oJt+i or by fhe diserepaneies belween h and h asymplofieally 
“average oul” so lhal fhe desired eonvergenee ean lake plaee. 

Eor any real T' > 0, define integer m(T') = min{f > 0 | X]fc=o ^ Conditions (iii)-(vi) 
below are required lo hold for eaeh a > 0 and some T > 0 (here a and T are real numbers): 

(iii) Eor eaeh 9, 


lim P < sup max 
t—>-co I 0<T'<T 


m(jT+T') — l 
k=m{jT) 


Ck) 



(B.l) 


(iv) 


lim P 

t^OO 


sup max 
j>t 0<T'<T 


m(jT+T') — l 


E 

k=m(jT) 


C^k Cfc ■ 



= 0 . 


(B.2) 


(v) There exisl nonnegative measurable funetions gi{9),g2{Cj sueh lhal 


||/i(0,011 <91(0)92(0, 


where gi is bounded on eaeh bounded sel of 9, and g 2 satisfies lhal supj>o E [ 5 ( 2 ( 0 )] < 00 
and 


lim P < sup max 
t^oo I 0<T'<T 


m{jT+T') — l 

aA:(^92(0) 

k=m{jT) 



(B.3) 


(vi) There exisl nonnegative measurable funetions 93 ( 0 ), 94(0 ^^eh lhal for eaeh 9, 9', 


||/i(0,O-/i(0',011 < 93 ( 0 - 0 ') 94 ( 0 , 
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where is bounded on eaeh bounded set of 9, with gi,{6) —)■ 0 as 6* —)■ 0, and g^ satisfies that 
suPi>oE[5r4(6)] < oo and 


lim P 

t^OO 


sup max 
j>t 0<T'<T 


m{jT+T') — l 

ak(^94:{^k) 

k=m(jT) 



(B.4) 


One method given in (Kushner and Yin, 2003, Chap. 6.2, p. 170-171) of verifying the eonditions 
(B.1)-(B.4) above is to show that a strong law of large numbers hold for the proeesses involved. In 
partieular, let represent h{9, ^k) — h{9) for eondition (hi), • Wfc+i for eondition (iv), g 2 {Ck) — 

^[g 2 {Ck)] for eondition (v), and 54 ( 6 ) - E[ 5 ' 4 (Cfc)] for eondition (vi). If 

(B.5) 

^ k=0 

for the respeetive {V’fc}, then the eonditions (B.1)-(B.4) hold for stepsizes satisfying at = 0{l/t) 
and = 0{l/t) (see Kushner and Yin, 2003, Example 6.1, p. 171). 

We now apply the eonvergenee results given earlier in this paper to show that the desired eon- 
vergenee (B.5) holds for the proeesses involved in eonditions (iii)-(vi). In partieular, for eaeh fixed 
9, the almost sure eonvergenee part of Theorem 2.1 implies that 

1 * 

^ Y.hi0,^k) “ 4 - E^[h{9,Co)]=hi0)- 

^ k=0 

Thus, eondition (iii) holds, as just diseussed. By Prop. 3.1(ii), ^ Yll:=o^k ' ^k+i 4 0, so 
eondition (iv) is also met. 

We verify now eonditions (v)-(vi). For eondition (v), we take gi{9) = ||0|| + 1, and we bound 
the funetion h by 

4 ( 6',011 < (^ll+ l)5'2(0> where p 2(0 = 

and L > 0 is some eonstant. (This bound ean be verified direetly using the expression of h and the 
faet that the sets S and A are finite.) Similarly, for eondition (vi), we take 53 ( 6 ') = ||0||, and we 
bound the ehange in h{9, 0 in terms of the ehange in 9 as follows: for any 9, 9' G IR”, 

\\h{9A) - h{9'A)\\ <\\0 - O'WgiiO, where 5 - 4(0 = L'||e||, 

and L' > 0 is some eonstant. Now the funetions 52,54 are Lipsehitz eontinuous in e. Henee, for 
j = 2,4, it follows from Prop. A.l that sup^>o E [gji^t)] < 00 , and it follows from Theorems 3.3 
and 3.1 that 

1 ^ 1 ^ 

(^ 0 )], and j— ^ E [g^ (O)] ^ E^ [g^ (O)], as f ^ 00 . 

fc =0 fe =0 

The preeeding two relations imply the desired eonvergenee: 

1 * 

^ E 0’ i = 24- 

^ fc =0 
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This shows that conditions (v)-(vi) are met. 

The theorem now follows by combining (Kushner and Yin, 2003, Theorem 6.1.1) with the char¬ 
acterization of the solution of the projected ODE (4.6) given by Lemma 4.1, using the fact that under 
Assumptions 2.1 and 2.2, the matrix C is negative definite (Prop. C.2). ■ 


Appendix C. Negative Definiteness of the Matrix C 

In this appendix we prove a necessary and sufficient condition (Prop. C.2 below) for the matrix C 
associated with ETD(A) to be negative definite. Recall from Eqs. (2.8)-(2.9) that 

C = M{I - 4> 

where 4* is the feature matrix with full column rank, is a substochatic matrix, and M is a 
nonnegative diagonal matrix with its diagonal, diag{M), given by 

diag{M) = - Pl^)-\ d^o^i = (d.o(l)i(l), ..., d^o{N)i{N)). 

Here Assumption 2.1 is in force and ensures that {I — exists and d.n-o{s) > 0 for all s G 5. 

The negative definiteness of C is important for the a.s. convergence of ETD(A). It is known 
to hold if i{s) > 0 for all s G 5 (Sutton et ah, 2015), and it is also known that in general, C is 
always negative semidefinite for nonnegative z(-). Our result in this appendix will yield a stronger 
conclusion: C is negative definite whenever it is nonsingular. 

In what follows, we first include a proof of the negative semidefiniteness of C just mentioned, 
for completeness (see Prop. C.l). We then give explicitly a condition on the approximation subspace 
which we will prove to be equivalent to the nonsingularity/negative definiteness of C (Prop. C.2). 
We also show, by specializing this subspace condition, that if those states s of interest (i.e., i{s) > 0) 
are represented by features (j){s) that are rich enough, then C can be made negative definite, without 
knowledge of the model (see Cor. C.l, Remark C.2). In addition, we discuss the connection of 
this subspace condition to seminorm projections, and show that when C is nonsingular, the ETD(A) 
solution can be viewed as the solution of a projected Bellman equation involving a seminorm pro¬ 
jection (see Remark C.l). 

C.l. Preliminaries 

Eirst, recall that the matrix C is said to be negative definite if there exists c > 0 such that 

y'Cy<-c\\y\\l, Vt/GR”, 

and negative semidefinite if c = 0 in the preceding inequality. The negative definiteness of C is 
equivalent to that of the symmetric matrix 

C + C'^ = [m{I - P^^) + (/ - 4>. 

Similarly to (Sutton, 1988; Sutton et ah, 2015), our analysis will focus on the N x N symmetric 
matrix 

G = M{I - Q) + {I - M 
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for the substochastic matrix Q = and the nonnegative diagonal matrix M as given above. We 
will use a theorem from (Varga, 2000, Cor. 1.22, p. 23), according to which a symmetric real matrix 
with positive diagonal entries is positive definite if it is strictly diagonally dominant or irreducibly 
diagonally dominant. Note that by definition, G is irreducibly diagonally dominant if G is irre¬ 
ducible^^ and satisfies fhe following diagonally dominanf condifions for every row of G, wifh sfricf 
inequalify holding for af leasf one row: 

|Gss| > IG'ssI; s = l,...,A^, 

S^S 

whereas G is strictly diagonally dominant if if safisfies fhe above inequalifies sfricfly for all rows. 

We now give a proof of fhe facf fhaf G is always negafive semidefinife, as menfioned af fhe 
beginning. This resulf is due fo (Suffon ef ah, 2015). 

Regarding nofafion, in fhe proofs below, for v we wrife v{s) for fhe s-fh enfry of v, and 

for an expression H fhaf resulfs in a vector in R^, we wrife {H){s) for fhe s-fh enfry of fhaf vector. 
For an expression H fhaf resulfs in an iV x N mafrix, we wrife [H]ss for ifs (s, s)-fh elemenf. We 
wrife 0 for a zero vector in any Euclidean space. 

Proposition C.l Let Assumption 2.1 hold. Then G is always negative semidefinite, and it is nega¬ 
tive definite if i{s) > 0 for all s G S. 

Proof We show fhaf if i{s) > 0 for all s G S, fhen G is sfricfly diagonally dominanf, and hence 
posifive definife; and fhaf if i(s) >0 for all s G S, fhen G is positive semidefinife. Since = 

—G — G~^, fhe conclusions abouf C will fhen follow. 

Lef J" = {s G 5 I i(s) = 0}. Suppose J" = 0. By definition Mgs = [d~lo fil — Q)~^)('S)- 
Using fhis fogefher wifh fhe facf fhaf Q is subsfochasfic, by a direcf calculafion as in (Suffon ef ah, 
2015), we have fhaf for each s G S, 

/ N \ N 

Gss - Y, IG.s-l =Mss-U-YQ-A+^ • [/ - Q]gg 

Sjts \ s=l / 5=1 

> 0+{dJo4l-Q)-^-iI-Q)){s) 

— 0 dj^o fis') 

> 0 , 

where in fhe lasf sfricf inequalify, we used fhe facf fhaf i(s) > 0 implies dj^o fis) > 0 under 
Assumption 2. l(ii). This shows fhaf G is sfricfly diagonally dominanf wifh posifive diagonal enfries, 
and hence posifive definife by (Varga, 2000, Cor. 1.22). 

Consider now fhe case J For all s G J, perfurb i(s) fo 5 > 0, and denofe by Gs the 
matrix G corresponding to the perturbed i(-). Then Gs is positive definite by the preceding proof. 
So for the original G, by continuity, G = lim5_^o Gs is positive semidefinite. ■ 


19. A symmetric matrix G is irreducible if it corresponds to a connected (undirected) graph when the indices are viewed 
as the nodes of the graph, and the nonzero entries of G are viewed as edges of the graph. 


(C.l) 

(C.2) 
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C.2. Main Result 

We now give the main result of this seetion. It expresses the negative definiteness eondition on C 
explieitly in terms of a eondition on the approximation subspaee E (the eolumn spaee of $), and it 
establishes the equivalenee between the nonsingularity of C and the negative definiteness of C. 

Proposition C.2 Let Assumption 2.1 hold, and let = {■s G 5 | Mgs = 0}. Suppose the 
approximation subspace E C IR^ is such that 

V £ E and v{s) = 0, Vs 0 ^0 u = 0. (C.3) 

Then the matrix C is negative definite. Furthermore, C is nonsingular if and only if the condi¬ 
tion (C.3) holds. 

The eorollary below gives a suffieient eondition (C.4) for C being negative definite, whieh ean 
be fulfilled wifhouf knowledge of fhe model, as we will elaborafe in Remark C.2. This eorollary 
is a direef eonsequenee of fhe preeeding proposifion, and follows from fhe observafion fhaf sinee 
i{s) > 0 implies Mgs > 0, fhe eondition (C.4) implies fhe eondition (C.3) in Prop. C.2. 

Corollary C.l Let Assumption 2.1 hold, and let J = {s ^ S \ i(s) = 0}. Suppose the approxi¬ 
mation subspace E C is such that 

V ^ E and v{s) = 0, \/s ^ J => n = 0. (C.4) 

Then the matrix C is negative definite. 

We now proeeed fo prove Prop. C.2. Roughly speaking, fhe mefhod of proof is fo deeompose 
fhe mafrix G info irredueible diagonal bloeks and use, among ofhers, fhe fheorem (Varga, 2000, Cor. 
1.22, p. 23) on irredueibly diagonally dominanf mafriees mentioned earlier. 

In fhe fwo feehnieal lemmas fhaf follow, we lef fhe mafrix G and fhe nonnegafive diagonal mafrix 
M fake a slighfly more general form: 

G = M{I-Q) + {M{L - Q)y, diag{M) = (/ - Q)-\ 

where Q is a subsfoehaslie mafrix (nof neeessarily and is a nonnegafive veefor (for 

nofafional simplieify, we keep using d^j-o^i insfead of infrodueing new nofafion). 

Lemma C.l Suppose the matrix (/ — Q) is invertible. Then the s-th diagonal entry Mgg = D if 
and only if the s-th row and s-th column of G contain all zeros. 

Proof We have G = M(I — Q) -\- (M(/ — Q)) Suppose s is a sfafe wifh Mgg / 0. Then fhe s-fh 
row of fhe mafrix M{I — Q) is nonzero (beeause fhe s-fh row of / — Q is nonzero, given fhaf (I — 
Q)~^ exisfs). The nonzero enfries of fhis row eannof be eaneeled ouf by fhe eorresponding enfries 
from fhe s-fh row of (M(/ — Q)) , beeause Q is a subsfoehasfie mafrix and M is nonnegafive. 
Therefore, fhe s-fh row of G musf also be nonzero. This proves fhe “if” pari. 

For fhe “only if” pari, suppose s is a sfafe wifh Mgg = 0. Then fhe s-lh row of fhe mafrix 
M(I — Q) eonlains all zeros, so, sinee G = M{I — Q) + (M (L — Q))~^ and is symmelrie, fo prove 
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the “only if” part, we only need to show that the s-th eolumn of M(/ — Q) is a zero eolumn. We 
prove this by eontradietion. 

Suppose for some state s ^ s, the (s, s)-entry of the matrix M{I — Q) is nonzero. Then using 
the definition of this entry ean be expressed as 

Mgg -[I-Q] (/ - Q)-i) {s) ■ Qrss / 0, 

whieh, in view of the equality {I — Q)~^ = X]fc>o nonnegativity of Q, implies that 

(djo j (s) • Qss > 0 for some k > 0. 

This in turn implies that for the state s, 

(djo j Q^) (s) > 0 for some k >0, 


and henee 

Mss = (djo, {I-Q)-^) is) > {dlo,Q^){s) > 0, 


eontradieting the assumption Mss = 0. Thus the s-th eolumn of M{I—Q) must be a zero eolumn. ■ 


Lemma C.2 Suppose that the matrix (/ — Q) is invertible and the matrix G is irreducible. Then 
the diagonal entries of M must be positive, and G is irreducibly diagonally dominant with positive 
diagonal entries, and hence positive definite. 

Proof If s is a state with Mss = 0, by Lemma C. 1, the s-th row and s-th eolumn of G would eontain 
all zeros, whieh eannot happen if G is irredueible. Thus Mss > 0 for all s € S. 

We have ealeulated in the proof of Prop. C.l [ef. Eqs. (C.1)-(C.2)] that for nonnegative i(-), 

/ N \ N 

Gss - |G,g| = M,, • I 1 - ^ Q,g j + Mgg • [/ - Q] > 0 

for all rows s. The striet inequality Gss — J2s^s > C) must hold for some s. To see this, 
note that the invertibility of (/ — Q) implies that 1 — Qss > 0 for some s, whieh together 

with Mss > 0 implies that the first term in the right-hand side above. Mss • (^1 — Qss^, 

must be positive for at least one row s, whereas the seeond term in the right-hand side above equals 
d-K°,i{s) > 0 [ef. Eqs. (C.1)-(C.2)]. Sinee G is irredueible by assumption, this proves that G is 
irredueibly diagonally dominant. 

Einally, sinee Q is substoehastie and {I — Q)~^ exists, the diagonals of I — Q must be positive. 
The diagonals of M are also positive, as proved earlier. Thus the diagonal entries Gss > 0 for all 
rows s. It then follows from (Varga, 2000, Cor. 1.22) that G is positive definite. ■ 

We are now ready to prove Prop. C.2. Regarding notation, in the proof, if Gi, G 2 ,..., are 
L square matriees (whieh ean have different sizes), we will write diag(Gi, G 2 ,..., Gl) for the 
bloek-diagonal matrix that has Gk as its fc-th diagonal bloek. However, for a single square matrix 
Gl, we will keep using diag{Gi) to mean the diagonal of Gi. 
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Proof of Prop. C.2 By Assumption 2.1(i), (/ — P 7 rr)~^ exists, which implies that for the suh- 
stochastic matrix Q = [cf. Eq. (2.6)], (/ — Q)~^ also exists. So the matrices M, C and G are 
well defined. By reordering the states if necessary, we can arrange G into a block-diagonal matrix 
with L blocks, 

G = diag(G^^\ (C.5) 

such that: 

(i) for each i = I,..., L — 1, the Ah-block G^^'^ is irreducible; and 

(ii) the L-th block G^^^ is a zero matrix (if G does not have a zero block, we will treat G^^'^ as a 
matrix of size zero, and this will not affect the proof below). 

Note that by Lemma C.l, the row/column indices associated with the zero block G^^'^ are exactly 
those in the set 

iTo = {s £ ‘5 I Mss = 0}. 


Since the condition (C.3) rules out the case Jq = S, G cannot be a zero matrix, so it must have at 
least one irreducible block. 

We prove next that the matrix Q has the following structure, matching the block-diagonal struc¬ 
ture of G\ 

Q(2) 


Q = 


* * 


(C.6) 


where the blocks , i < L — 1, on the diagonal correspond to the blocks £ < L — 1, on 
the diagonal of G, the unmarked blocks contain all zeros, and the ^-blocks can have both zeros and 
positive entries. 

To prove Eq. (C.6) by contradiction, suppose it does not hold. This means that there must exist 
two states s / s with Qss > 0, but the entry Qss lies inside an unmarked block of the matrix on 
the right-hand side of Eq. (C.6). This position of Qss implies Gss = 0, which is possible only if 
Mss = 0 (otherwise, Qss / 0 would force Gss / 0). But if Mss = 0, s G Jq, which is the set of 
indices associated with the last zero block, as shown earlier. Consequently, the entry Qss cannot lie 
inside an unmarked block as we assumed. This contradiction proves that Eq. (C.6) must hold. 

Erom the structure of Q shown in (C.6), it follows that (/ — Q)~^ has the same structure: 






* 


(/-qW)-' 


* 


* 


* 


(C.7) 


Since G = M{I — Q) + {I — Q)~^M, Eqs. (C.5), (C.6) and (C.7) together imply that for each 
i < L — 1, the matrix can be expressed as 


gW = + (I - 
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where is the £-th diagonal hloek in the eorresponding deeomposition of M as 

and if we deeompose the veetor similarly as ... ,dl^\),thenfor eachi < L — 1, 

the diagonal hloek has its diagonal entries given hy 

diag{M^^'>) = £<1-1. 

In the above expression, we also used the faet d^.^\ = 0, whieh is implied by 
matrix (whieh we showed at the beginning of this proof). 

We now apply Lemma C.2 to eaeh irredueible hloek £< L — 1 (with M 
Q = a substoehastie matrix). This yields that eaeh of these is positive 
eonsequently, the bloek-diagonal matrix 

G = diag(^G^^\ ..., 

is positive definite. 

Finally, we prove the statement of the proposition. For the bloek-diagonal deeomposition of G 
as G = diag{G,G^^^), write a point y C IR^ eorrespondingly as y = {yi,yo)- he., the indiees 
of the eomponents of yo are those in ,(7o = {•s C 5 | Mgs = 0 }, and the dimension of yi is 

iV = iV-|Jo|. 

Sinee G is positive definite, there exists some c > 0 sueh that 

yi^Gy,>c\\yi\\l yy,eR^. (C.8) 

Consider a point y = {yi,yo) ^ E with yi = 0. Then yo = 0 by the assumption (C.3). Sinee E is 
a subspaee, this implies that there exists some eonstant d > 0 sueh that 


being a zero 

= and 
definite, and 


inf ||yi||2>5. 

2 /eS, \\y\\ 2 =i 


Using Eqs. (C.8)-(C.9), we have 


inf y^Gy = inf yi''^Gyi > 
y&E,\\y\\2=l y&E,\\y\\2=l 


inf c I 

y&E, || 2 ;|| 2=1 


yilli > c6^ > 0. 


(C.9) 


(C.IO) 


Sinee E is the eolumn spaee of <1> and <I> has linearly independent eolumns by definition, the inequal¬ 
ity (C.IO) establishes that the matrix <h^G<f' = — G — G^ is positive definite, and eonsequently, G 
is negative definite. 

The preeeding proof also shows that G is nonsingular if the eondition (C.3) holds. To eomplete 
the proof, let us assume that the eondition (C.3) does not hold and show that G must be singular. 
We will use the strueture of the matrix M{I — Q) revealed in the preeeding analysis to proof this. 
Deeompose <1> into two bloeks as 


20. Using the expression {I — Q) ^ = X]fe>o ^ from the definition of Mss that Mss > dT^°,i{s). 

Therefore, Mss = 0 implies that diro.ifs) — 0. 
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where <ho eonsists of those rows of whose indiees are in = {s £ ‘5 | Mgs = 0}. Denote 

M = diag{M^^\ ... Q = diag{Q^^\ .. 

whieh are the suh-matriees of M and Q, respeetively, obtained hy deleting the rows/eolumns whose 
indiees are in Jq. From the strueture of the matriees Q, (I — Q)~^ and the eorresponding expression 
of M^^\i < L — 1, that we showed earlier, it follows that the matrix C = — Q)^ is 

indeed given hy 

C = -$7 

Now if the eondition (C.3) does not hold, then there exists y = (yi, yo) G E sueh that yi = 0 and 
yo / 0. Expressing y in terms of <1>, we have yi = <l>ix = 0 and yo = / 0 for some nonzero 

X G IR'^. This implies rank(T>i) < n, so using the preeeding expression of C, we have rank(C') < n 
and henee C is singular. ■ 

Finally, we make two remarks on the eonditions (C.3) and (C.4) in Prop. C.2 and Cor. C.l. 

Remark C.l (Seminorm projection) Using seminorm projeetions to formulate the projeeted Bell¬ 
man equations assoeiated with TD methods is introdueed in (Yu and Bertsekas, 2012). There, eon¬ 
ditions of the form (C.3) or (C.4) are used to define a projeetion on the approximation subspaee 
with respeet to a seminorm. We ean use this formulation here to interpret the solution of ETD(A) 
and EESTD(A). Speeifieally, define a weighted Euelidean semi norm || • ||^ on using diag{M) 
as the weights, as 

Condition (C.3) ensures that the projeetion Ttyy onto E with respeet to the seminorm || • ||yy is 
well-defined and has the matrix representation 


(ef. Yu and Bertsekas, 2012, See. 2.1). So by Prop. C.2 and the eonvergenee results of this paper, 
when C is nonsingular, ETD(A) and EESTD(A) solve in the limit the projeeted Bellman equation 

The relation between the solution v = ^6* of this equation and the desired value funetion Utt, in 
partieular, the approximation error, ean be analyzed then, using the oblique projeetion viewpoint 
(Seherrer, 2010) (for details, see also (Yu and Bertsekas, 2012)). 

Remark C.2 (Equivalent conditions in terms of features) The eondition (C.3) ean be paraphrased 
in terms of the features (j){s) as follows: 

Vs G 5 with = 0, (/)(s) G syan{0(s) I s G 5 and Mgj > O}; (C.ll) 

namely, from those states with positive emphasis weights Mgg > 0, n linearly independent feature 
veetors ean be found. Similarly, the eondition (C.4) ean be paraphrased as: 

Vs G 5 withi(s) = 0, (j){s) G span{(j){s) | s G 5 andz(s) > O}; (C.12) 
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namely, from the states with positive interest weights, n linearly independent feature veetors ean be 
found. This shows that even without knowing and M, by designing a rieh enough set of features 
for states of interest beforehand, we ean ensure the suffieient eondition (C.4) for the nonsingularity 
and negative definiteness of the matrix C. 

Conditions like (C.ll), (C.12) [or equivalently, (C.3), (C.4)] are naturally satisfied in fhe ease 
where fhe approximafe values of fhe poliey vr af eerfain sfafes (e.g., fhose sfafes s wifh Mgs = 
0 or i{s) = 0) are inferpolafed or exfrapolafed from fhe approximafe values of tt af some ofher 
“represenfafive” sfafes, based on fhe “proximify” of fhe former sfafes fo fhe represenfafive ones. 
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